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Preface 


It seemed worthwhile to try how far the principle of evolution would throw 
light on some of the more complex problems in the natural history of man. 
Charles Darwin The Descent of Man (1871) 


Ever since Darwin published his The Origin of Species in 1859, the theory of evo- 
lution has been the single most important unifying idea in biology. Although 140 
years have separated the comparison of pigeon breeds from the comparison of 
mammalian gene sequences, the underlying evolutionary questions posed are not 
dissimilar. In the intervening years, evolutionary theory has permeated all the 
natural sciences. Indeed, the study of human evolution has now become so broad 
that it spans biochemistry, anatomy, physiology, psychology and social behavior, 
linguistics, epidemiology, demography, paleontology and of course genetics. It has 
also provided the impetus to forge new alliances between some of these disciplines 
whose practitioners have traditionally shared little common ground. 

Knowledge of our origins, and of our relationship to the rest of the natural 
world, has the potential to enrich a wide range of human activities. Indeed, if ‘the 
proper study of mankind is man’, then it is arguable that the study of human evo- 
lution should be regarded as fundamental to our continuing quest for self-knowl- 
edge. As McConkey and Goodman (1997) put it: 


It is our conviction that comparative analysis of human and ape genomes is far 
more than an excursion into natural history at the molecular level. Until we 
have a detailed understanding of the genetic differences between ourselves 
and our closest evolutionary relatives, we cannot really know what we are. 


The dramatic progress made in human molecular genetics over the last two 
decades has produced a wealth of information on gene structure, mapping and 
expression as well as mutation, polymorphism and comparative genome analysis. 
These data have been successfully superimposed upon the firm foundations of 
evolutionary biology already established by a synthesis of population genetics, 
molecular biology and phylogenetics (Li, 1997). Within the next 5 years, the 
Human Genome Project should yield the entire sequence of the human genome, 
providing us with an unparalleled opportunity to understand the structure and 
function of our genome as well as its evolution (Clark, 1999). To appreciate fully 
the relationship between the structure and function of our genes and the proteins 
they encode, we must be aware of their evolutionary past and their molecular 
ontogeny. Conversely, to have any real understanding of the evolutionary path- 
ways taken, we must also understand gene structure and function. 

The aim in writing this book was to bring together the highly dispersed 
literature on human gene structure, function and expression, to integrate this 
with our emerging knowledge of chromosome and genome structure, and to draw 
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comparisons both within and between paralogous and orthologous gene 
sequences in order to establish the nature of the mutational mechanisms respon- 
sible for their evolutionary divergence. Such an approach, encompassing a large 
number of human genes and their extended families, is essential for any attempt 
to discern underlying evolutionary principles. 

This volume is divided into three parts. The first, comprising two chapters, is 
intended to serve as an introduction to the structure, function and evolution of 
the human genome. The second, containing four chapters, focuses on the evolu- 
tion of gene structure, organization and expression, the origins of human genes 
and gene inactivation. The third part, containing the remaining four chapters, 
discusses the wide range of mutational mechanisms that have created and fash- 
ioned extant human genes and fine-tuned their structure, function and expres- 
sion. 

A central theme of this volume is that mutations in human gene pathology and 
evolution represent two sides of the same coin in that those same mutational 
mechanisms that have frequently been implicated in human pathology have also 
been involved in potentiating evolutionary change. In order to illustrate this par- 
allelism, a large number of examples of different types of mutational lesion which 
have occurred during molecular evolution have been collated. Regardless of 
whether they are advantageous, disadvantageous or neutral, these mutational 
changes and their putative underlying causal mechanisms are very similar. This 
book therefore constitutes a companion volume to Human Gene Mutation (Cooper 
and Krawczak, 1993), which investigated the causes and consequences of patho- 
logical gene lesions and demonstrated that the nonrandomness of mutation is 
determined largely by the local DNA sequence environment. 

It is now clear that the gene has often been a dynamic entity over evolutionary 
time, not a static one. Many genes have undergone gross rearrangement as a result 
of the action of any one of a number of mutational processes such as insertion, 
inversion, duplication, repeat expansion, translocation or deletion. It turns out 
that even relatively conserved genes do not necessarily always change by a slow 
incremental process of single base-pair substitution; rather, such genes may 
acquire multiple nucleotide substitutions simultaneously by mechanisms such as 
gene conversion. 

Ten years ago, in order to illustrate principles of human gene evolution, it 
would have been necessary to draw extensively on examples from organisms such 
as Drosophila and yeast. In recent years, however, the literature has expanded to 
such an extent that I have been able in most cases to use human genes as examples. 
Where this has not been possible, I have tried to quote examples from other mam- 
mals or, failing that, examples from other vertebrates. Wherever possible, data 
from nonhuman primates have been included in order to place human gene evo- 
lution in its proper context. For genes with greater antiquity, comparisons have 
been made with orthologues from other taxa or even other phyla. This notwith- 
standing, I have tried to confine my treatment of the subject area to the evolution 
of human gene sequences and have declined to stray too far from the gene itself, 
the primary target of mutation and source of hereditary variation. Some topics, 
such as the evolution of the mitochondrial genome, repetitive DNA sequences or 
human populations could easily have had whole volumes devoted specifically to 
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them but have of necessity been treated somewhat summarily. Protein evolution 
also has a vast literature and I have deliberately not attempted to cover this topic 
in any detail here. There are numerous texts that discuss the evolution of protein 
folds, protein sequences and the relationship between protein structure and func- 
tion; the interested reader is referred to these specialized texts for detailed infor- 
mation. 

Every effort has been made to name the human genes cited according to their 
encoded products and also to specify their symbols (as currently recommended by 
the Human Gene Nomenclature Committee). In addition, their chromosomal loca- 
tion has been provided wherever possible, not only because of its potential func- 
tional significance but also to avoid ambiguity if and when gene symbols are 
altered. Altogether, more than 1600 different human genes are referred to specifi- 
cally in the text and these have been listed in separate indices. 

Despite the recent rapid advances in molecular evolutionary genetics, I have 
been conscious in places of being able only to describe or list, without being able 
to discern any underlying principle or illuminate a question by providing an 
explanatory mechanism. I am also acutely aware of the danger of supposing that 
we understand a given process when in fact all that we have done is to have col- 
lated the basic information necessary for us to derive explanations. As Morgan 
put it: 


When the biologist thinks of animals and plants, ... he runs the risk of think- 
ing that he is explaining evolution when he is only describing it. 
T.H. Morgan Evolution and Genetics (1925) 


It is of course almost impossible not to view evolution from a human perspective, 
that of the only organism able to contemplate its own origins and perhaps its own 
demise. Indeed, Haldane was aware of our tendency to anthropomorphize: 


I have been using such words as ‘progress’, ‘advance’, and ‘degeneration’, as I 
think one must in such a discussion, but I am well aware that such terminol- 
ogy represents rather a tendency of man to pat himself on the back than any 
clear scientific thinking. The change from monkey to man might well seem a 
change for the worse to a monkey ... we must remember that when we speak 
of progress in evolution we are already leaving the relatively firm ground of 
scientific objectivity for the shifting morass of human values. 

J.B.S. Haldane The Causes of Evolution (1932) 


At the heart of the way in which we conceptualize the process of evolution is an 
apparent dichotomy: on the one hand, sequence conservation is usually held to 
imply functionality, on the other, the emergence of novel functions implies 
change. Thus, selection may act conservatively so as to retain features of struc- 
tural or functional importance (negative or purifying selection), or act so as to 
favor changes that confer some advantageous characteristic (positive selection). 
Evolution can, however, also proceed in a stochastic neutralist fashion, some fea- 
tures being adopted not necessarily because of any selective advantage accruing to 
the organism, but rather owing to the vagaries of population size, structure or 
dynamics over an extended period of time. Further, it is now clear that some DNA 
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sequences either possess or acquire a mutational momentum of their own and, in 
the absence of negative selection, will drive the process of evolutionary change 
without the necessary involvement of selection. In practice, these diverse mecha- 
nisms are acting in different combinations and permutations at many loci simul- 
taneously. Evolution has no foresight, however, and the ultimate arbiter of 
evolutionary success is always the number of surviving, reproducing, offspring. 

In the years ahead, one of the major challenges will be to determine how we dif- 
fer from our closest living relatives, the chimpanzees and the other great apes; to 
explain in genetic terms the differences, not only in anatomy but also in the intel- 
lect and, in particular, language, that have come about over the last 5—7 million 
years since the divergence of the human and chimpanzee lineages. It appears 
unlikely that such major phenotypic differences can be explained simply in terms 
of incremental changes in the structure of specific genes leading to relatively sub- 
tle changes in protein structure and function. Instead, it is anticipated that we 
shall need to locate key changes in regulatory genes that have served to alter the 
tissue and developmental specificity of gene expression, or the expression of mul- 
tiple downstream genes, gene pathways and even ultimately gene networks. 

The story of our evolutionary past, written in the coded language of DNA, is 
told in our genome sequence. We have learned how to access it, we shall soon be 
able to read it, but the most demanding task of all still remains, that of interpret- 
ing it. 
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Structure and function in 
the human genome 


1.1 Introduction 


Nothing in biology makes sense except in the light of evolution. 
Theodosius Dobzhansky (1973) 


The haploid human genome comprises 23 chromosomes containing between 
them some 3.2 Xx 10° bp of DNA. The bulk of the genome comprises DNA 
sequence of varying degrees of repetitivity whilst about 10% represents unique 
(single copy) sequence containing perhaps 70 000 genes. The repetitive portion 
contains a mixture of DNA sequence elements which may have a structural or 
regulatory role or may be merely ‘junk DNA’ without obvious function 
(Zuckerkandl, 1997). Human genes contain within them all the genetic informa- 
tion necessary to specify the encoded proteins but they also contain further infor- 
mation, in the form of a large number of different DNA sequence motifs that 
serve to control mRNA expression, splicing, transport, and stability. This chapter 
constitutes a short introduction to the structure, function and expression of the 
human genome together with a brief description of the types of mutational lesion 
in human genes that have been found to be responsible for inherited disease. 


7.1.1 Chromosome structure and function 


DNA plays a role in life rather like that played by the telephone directory in 
the social life of London: you can’t do anything much without it, but, having 
it, you need a lot of other things — telephones, wires, and so on — as well. 

C.H. Waddington (1968) 


Chromatin structure. Human chromosomes contain DNA in a highly coiled 
and condensed form, organized and packaged by structures known as nucleo- 
somes. Chains of nucleosomes comprise a ‘10 nm fibre’ and this is coiled to form 
the ‘30 nm fibre’ which is in turn further coiled to form chromatin (reviewed by 
Kornberg and Lorch, 1992; Paranjape et al., 1994; Wolffe, 1992). The highest 
degree of condensation is found in transcriptionally inactive regions, those 
regions in which mRNA synthesis from specific genes is turned off. To allow 
transcriptional activation of a gene (i.e. to allow the enzyme RNA polymerase to 
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initiate mRNA synthesis), chromatin must be uncoiled, a process which occurs 
in at least three stages: the unfolding of large chromosomal domains (25-100 kb), 
the remodeling of the chromatin structure of gene regulatory regions and the 
alteration of nucleosome structure in transcribed regions (Jackson, 1997). 
Unfolding reveals binding sites on the chromosomal DNA for activator proteins. 
Once bound, these proteins alter nucleosome positioning and reveal further 
binding sites for activator proteins. Decondensed chromatin thus provides an 
accessible template for the assembly of the transcriptional initiation complex. 


Centromeres. The centromere is essential for normal disjunction of the chromo- 
somes following cellular division at meiosis and mitosis. Centromeric DNA con- 
sists of arrays of tandemly repeated DNA sequences which have undergone 
homogenization by repeated genetic exchanges (Lee et al., 1997; Warburton et al., 
1993). Alphoid satellite DNA is the only satellite DNA family known to be pre- 
sent in the centromeric regions of all human chromosomes. The basic 169 and 172 
bp repeats of this primate-specific satellite DNA comprise the bulk of human cen- 
tromeric DNA and contain a 17 bp binding site for the centromere-specific pro- 
tein CENP-B. This binding site motif is present at the centromeres in the 
chromosomes of all the anthropoid apes but is absent from the genomes of Old 
and New World monkeys and prosimians (Haaf et al., 1995). It is also absent from 
human Y chromosomal alphoid satellite DNA (Jørgensen, 1997). 

Human alphoid satellite DNA may be divided into subsets which are largely 
chromosome-specific (Jgrgensen, 1997). They may however be grouped into 
four supra-chromosomal families each characterized by a specific monomer 
type (Jorgensen, 1997). Homologies exist between alphoid satellite DNAs from 
different ape species but these are not associated with homologous chromo- 
somes (Samonte et al., 1997; Warburton et al., 1996). Although alphoid satellite 
repeats may have evolved from a common ancestral repeat monomer (Haaf and 
Willard, 1998), they have also been subject to concerted evolution between 
homologous chromosomes within a given species (Haaf and Willard, 1997; 
Jørgensen et al., 1992). 

Using a combination of oligonucleotide primer extension and immunocyto- 
chemistry, Mitchell et al. (1992) showed that the alphoid repeats were closely asso- 
ciated with the kinetochore (the structural element on the chromosome that binds 
to the mitotic spindle). The presence of (AATGG)n (CCATT)n repeats in the cen- 
tromeric region suggests that stem-loop structures might form which could serve 
as specific recognition sites for kinetochore function (Catasti et al., 1994). Alphoid 
satellite DNA sequences are not however restricted to centromeric regions 
(Baldini et al., 1993). 


Telomeres. Telomeres allow the end of the chromosomal DNA to be replicated 
completely without the loss of bases at the termini (reviewed by Blackburn, 1994; 
Gilson et al., 1993). They are the sites at which the pairing of homologous chro- 
mosomes is initiated and in humans contain long arrays (averaging about 10-15 
kb) of minisatellite DNA comprising tandem hexanucleotide repeats, most fre- 
quently TTAGGG (Brown, 1989). Other telomeric hexanucleotide repeats (e.g. 
TTGGGG, TGAGGG) are also known (Allshire et al., 1989; Brown, 1989). These 
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sequences and their variants are tandemly repeated but are nonrandomly distrib- 
uted and polymorphic in terms of their location. Chimpanzees also possess the 
TTAGGG telomeric repeat (Luke and Verma, 1993) but differ from humans in 
terms of their subterminal satellite sequences (Royle et al., 1994). 

The simple sequence of telomeres is synthesized by the ribonucleoprotein poly- 
merase, telomerase (Blackburn, 1992), and is thought to protect the ends of chromo- 
somes from degradation during DNA replication. Telomere length decreases with 
age and number of cell divisions. Some chromosomes (e.g. 9p, 12p, 14p, 17p, 21p 
and 9q) have telomeres shorter than the average whereas others (e.g. 4q, 5p, 18q and 
Xp) have telomeres which are longer (Martens et al., 1998). Telomere length poly- 
morphisms are apparent for some chromosomes, for example 11p (Martens et al., 
1998). 


Sites of recombination. Chiasmata represent the cytological evidence for 
recombination. At meiosis, each pair of homologous chromosomes possesses at 
least one chiasma and the number of chiasma per pair is proportional to the size 
of the chromosome. Chiasma frequency is a function both of distance from the 
centromere (Laurie and Hulten, 1985) and chromosome band ‘flavor’: the G+C 
rich T bands exhibit a six-fold higher chiasma frequency than G bands 
(Holmquist, 1992). Since G+C content is in general positively correlated to chi- 
asma frequency (Eyre-Walker, 1993), this could explain why gene density is 
higher in T bands (see section 1.1.1, Gene distribution and density). 

Recombination is considerably higher in human females than in males, as evi- 
denced by the average distance between two markers in the female genetic map 
being 85% longer than in males. Chiasma frequency is a function of distance from 
the centromere (Laurie and Hulten, 1985). Recombination therefore tends to 
increase towards the telomeres, the distal 15% of chromosomes containing 40% of 
the chiasmata. Finally, it may also be pertinent to consider that intrachromosomal 
homologous recombination may be enhanced by transcription in mammalian 
cells (Nickoloff, 1992). 

Obligatory recombination occurs during male meiosis within the pseudoauto- 
somal region, a 2.6 Mb stretch of homologous sequence at the tip of the short arms 
of the X and Y chromosomes (Petit et al., 1988; Ellis and Goodfellow, 1989). Genes 
in this region escape X-inactivation and the boundaries of this region appear to 
have been conserved evolutionarily between Old World monkeys and human 
(Ellis et al., 1990). 

Regions of sex-specific hypo- and hyper-recombination have been reported in a 
study which compared genetic and physical maps of human chromosome 19 
(Mohrenweiser et al., 1998). Other recombination hotspots have been character- 
ized in specific human genes including those encoding the T-cell receptor beta 
chain (TCRB; 7q35; Seboun et al., 1993) and HLA-associated ATP transporter 2 
(TAP2; 6p21.3; Cullen et al., 1995) loci. Several different types of DNA sequence 
have been proposed to be recombinational hotspots in the genomes of mice and 
men. (CAGA), and (CAGG),_ represent hotspots of recombination in the murine 
MHC gene cluster (Steinmetz, 1987). Other sequences thought to promote recom- 
binational instability are alphoid repeats (Heartline et al., 1988), a mariner transpo- 
son-like element (Reiter et al., 1996), Z-DNA (‘left-handed DNA’; Wahls et al., 
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1990a) and minisatellite sequences (Chandley and Mitchell, 1988; Wahls et al., 
1990b). Recombinational breakpoints have also been found to be associated with 
topoisomerase I cleavage sites in the rat genome (Bullock et al., 1985). The major- 
ity of these cleavage sites contain the sequences CTT and GTT. It may therefore be 
that the process of nonhomologous recombination is mediated by topoisomerase I. 


Gene distribution and density. Several thousand genes have now been mapped 
to within single chromosome bands. Some 80% map to the G-C rich R bands 
whilst 20% map to G bands (Bickmore and Sumner, 1989; Craig and Bickmore, 
1993). A similar distribution is apparent for CpG islands (see Section 1.1.1, CpG 
islands): 86% are located in R bands (Craig and Bickmore, 1994; Larsen et al., 
1992). ‘Housekeeping’ genes are strictly confined to the R bands together with 
about half of the tissue-specific genes (Holmquist, 1992) whereas the remainder of 
the tissue-specific genes are present in the G bands. One of the four recognized 
types or ‘flavors’ of R band, known as T bands, are often found at telomeres, 
exhibit the highest G+C content and contain between 58% and 68% of R band 
genes as well as the majority of CpG islands (Collins et al., 1996; Holmquist, 
1992). 

Chromosomes 13 and 18 appear to possess a relatively low gene density and 
chromosome 19 a relatively high density as evidenced by the chromosomal 
assignment of some 320 cDNAs derived from a human brain cDNA library 
(Polymeropoulos et al., 1993). Interestingly, DNA excision repair may be prefer- 
entially directed toward regions of high gene density (Surrales et al., 1997), a 
reflection perhaps of the preferential repair of actively transcribed gene 
sequences. 


/sochores. The human genome is a mosaic of large (>300 kb) DNA segments or 
isochores that are compositionally homogeneous and which can be subdivided 
into a small number of families characterized by different degrees of GC-richness 
(30-60%) (Bernardi et al., 1993a,b). Five families have been identified: L1 and L2 
which are GC-poor and comprise 62% of the genome and H1, H2, and H3 which 
are GC-rich and represent 22%, 9%, and 3% of the genome respectively. Gene con- 
centration varies between isochores: 34% of human genes are located in L1 and 
L2 isochores, 38% in the H1 and H2 isochores, and 28% in the H3 isochores 
(Mouchiroud et al., 1991; Saccone et al., 1996). The banding pattern of chromo- 
somes reflects the isochore organization: thus, the G bands are formed by L1 and 
L2 isochores whilst the T bands are formed by the H2 and H3 families. A recent 
study of two primate globin pseudogenes which reside in different isochore com- 
partments has provided evidence that isochores have arisen as a result of muta- 
tional bias rather than from the action of selection (Francino and Ochman, 1999). 


Matrix attachment regions. Chromatin is attached to the nuclear matrix or scaf- 
fold at specific sites known as matrix or scaffold attachment regions 
(MARs/SARs). The organization of chromatin with respect to the nuclear scaffold 
is thought to determine chromosome architecture in terms of its functional 
domains; this in turn influences gene activity (reviewed by Dillon and Grosveld, 
1994 and Walter et al., 1998). Indeed, MARs may function so as to place genes at 
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the nuclear scaffold in order to facilitate their transcription. MARs do not appear 
to share extensive sequence homology but often comprise 200 bp of AT-rich 
DNA, for example the sequence AATATTTTT in the murine immunoglobulin « 
gene locus (Cockerill and Garrard, 1986). A number of MAR-binding proteins are 
known to bind to MARs including the attachment region-binding protein, ARBP 
and histone H1. 

MARs appear to be preferentially associated with topoisomerase II cleavage 
sites (reviewed by Laemmli et al., 1992) and share sequence homology with bind- 
ing sites for homeobox proteins (Boulikas, 1992). Topoisomerase II plays a role in 
the segregation of daughter chromosomes after DNA replication and also in 
chromosome condensation; it binds preferentially to MARs (Adachi et al., 1989). 
Vertebrate topoisomerase II cleavage sites also occur in association with MARs 
and manifest a consensus sequence, A/G N T/CE NNCNNGT/CNGG/TTN 
T/C N T/C (Spitzner and Muller, 1988). 

MARS also appear to be preferentially associated with enhancer-type elements. 
Indeed, MARs stimulate heterologous gene expression in reporter gene experiments. 
A cis-acting regulatory element 3’ to the human “y-globin (HBG1) gene, known to be 
associated with the nuclear matrix, has been shown to bind specifically to an AT-rich 
binding protein (SATB1) that binds to MARs (Cunningham et al., 1994). 


Origins of DNA replication. Chromosomal DNA replication initiates at specific 
points (origins) and proceeds outward bidirectionally from specific loci. Although 
a number of putative origins of replication have been identified in mammalian 
species (reviewed by Coverley and Laskey, 1994; De Pamphilis, 1993), data are 
still sparse. One of the best characterized is that found in human between the ô- 
(HBD) and B-globin (HBB) genes on chromosome 11 (Kitsberg et al., 1993a). 
This replication origin is bidirectional and functional regardless of the transcrip- 
tional state of the B-globin gene. From the study of six putative origins of replica- 
tion (including one in the human c-myc oncogene), Dobbs et al. (1994) claimed to 
have derived a consensus sequence, albeit a fairly redundant one: A/T A A/T T T 
A/G/T A/G/T A/T A/T A/T A/G/T A/C/T A/T G A/T A/C/T A/C A A/T T T. 
However, there are probably several different classes of replication origins which 
possess different sequence characteristics (Boulikas, 1996). 

Replication origins are often associated with CpG-rich regions (Delgado et al., 
1998; Rein et al., 1997; Tasheva and Roufa, 1995) and may sometimes be located in 
the vicinity of matrix attachment regions (Section 1.1.1, Matrix attachment regions) 
(Lagarkova et al., 1998). In the human genome, the units of DNA replication range 
in size from 50 kb to 600 kb and are often clustered (Hand, 1978). For instance, 
there are at least six such replicons within the human dystrophin (DMD; Xp21) gene 
(Verbovaia and Razin, 1997). The gene-rich R bands replicate early in S phase 
whilst the G bands replicate late. Housekeeping genes invariably replicate early 
whilst tissue-specific genes can be early or late replicating (some replicate earlier 
when transcriptionally active) (Goldman et al., 1984; Hatton et al., 1988). 
Nontranscribed genes on the X chromosome also replicate late (Torchia et al., 1994). 


7.1.2 Gene organization and transcriptional regulation 


Gene structure and regulation. The coding portion of the human genome, 
roughly 5% of the total DNA complement, probably contains some 70 000 
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different gene sequences (Fields et al., 1994). Thus, the smallest human chromo- 
some, 21, may contain as many as 2000 genes. 

It is now known that most genes in higher organisms are not contiguous but 
rather are a complex mosaic of protein coding (exon) and intervening non-coding 
(intron) sequences (Figure 1.1). The exons represent that portion of the gene which 
encodes the amino acid sequence of the protein product plus 5’ and 3’ noncoding 
regions. Initially, both exons and introns are transcribed into mRNA but the 
intronic portion is ultimately removed during mRNA maturation by a process 
known as splicing (see section 1.1.2, Sequence motifs involved in mRNA splicing and 
processing). The mature mRNA is then translated into the amino acid sequence of a 
protein on the ribosome. Although the central dogma of molecular biology was 
therefore once summarized as ‘DNA makes RNA makes protein,’ the reverse flow 
of genetic information is also possible by reverse transcription of RNA into DNA 
(copy DNA or cDNA). 

Each individual gene differs not only with respect to its DNA sequence speci- 
fying the amino acid sequence of the protein it encodes, but also with respect to 
its structure. A few human genes are devoid of introns (e.g. thrombomodulin 
(THBD) which spans 3.7 kb) whereas others may possess a considerable number, 
for example 79 in the 2.4 Mb dystrophin (DMD) gene and as many as 118 in the 
&al(VII) collagen (COL7A1) gene (Christiano et al., 1994). Introns may be classi- 
fied according to whether they interrupt the reading frame of the encoded pro- 
tein. Thus phase 0 denotes that the intron lies between two codons, phase 1 
between the first and second nucleotides of a codon, and phase 2 between the 
second and third nucleotides of a codon. 

Some introns may be huge as in the case of the first intron of the human 
COLSAI gene (~600 kb; Takahara et al., 1995). The average length of a vertebrate 
intron has been estimated to be ~620 bp (Hawkins, 1988) but introns separating 
exons preceding the coding ones are often rather longer with an average length of 
>1800 bp (Hawkins, 1988). This suggests that evolution may sometimes have had 
to trawl quite far upstream of a gene to recruit appropriate DNA sequence motifs 
to act as promoter/regulatory elements within the 5’ untranslated region. The 
average length of an internal exon is ~140 bp (Hawkins, 1988) but this average 
figure conceals some very large exons, for example in the human factor VIII (F8C; 
Xq28) [3106 bp], apolipoprotein B (APOB; 2p23-p24) [7572 bp] and mucin 5B 
(MUCSB; 11p15.5) [10 690 bp] genes. 


Flanking Exon 1 Exon 2 Exon 3 Flanking 
region region 
i ATG 
Ss M Intron | — Intron II T ‘ 
< a: — 3 
| | GT AG GT AG 
GC TATA eas 
box box Initiation Stop Poly(A)- 
CAAT GC codon codon addition site 
box box Transcriptional AATAA 
initiation site 


Figure 1.1 Schematic structure of an archetypal human protein-coding gene. UTR, 
untranslated region. 
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Some human genes encode very large proteins, for example the cardiac titin 
(TTN; 2q31) cDNA is 82 kb in length and predicts a 27 000 amino acid protein 
with a molecular weight of nearly 3000 kDa (Labeit and Kolmerer, 1995). Large 
mRNAs are also generated from the dystrophin (DMD; Xp21.2-21.3; 14 kb), 
apolipoprotein B (APOB; 2p23-24; 14 kb), and mucin (membrane-associated gly- 
coprotein) genes MUC2, MUCSAC, MUCSB, MUC6 (11p15.5; 15-18 kb), MUC3 
(7q22; 17 kb), and MUC4 (3q24; 18-24 kb) (Debailleul et al., 1998). 

Only ~10% of human genes encode proteins of known function. The 
sequences of many of these genes can be found in GenBank 
(http://www.ncbi.nlm.nih.gov/). In addition, hundreds of thousands of 
“expressed sequence tags’ (ESTs; Gerhold and Caskey, 1996) have been character- 
ized which together represent a nonredundant set of >45 000 human genes 
(UniGene; http://www.ncbi.nlm.nih.gov/UniGene/index.htmll). 

Precisely what constitutes a gene is somewhat contentious (Epp, 1997) but a 
crude working definition might be: a transcription unit plus associated regulatory 
sequences which together serve to specify both the sequence and the expression pattern of a 
protein product. The term ‘gene’ cannot, however, be restricted to protein coding 
sequences since some genes (e.g. snRNA, rRNA, tRNA, XIST, H19, IPW) encode 
RNA molecules with a variety of biological functions and which are not translated 
into protein. A simple universally applicable definition of a gene is difficult to 
derive owing to the existence of exceptions to almost any rule that one might 
devise. Thus, some transcription units may encode multiple unrelated proteins 
with different functions as a result of alternative splicing (see section 1.1.2, Sequence 
motifs involved in mRNA splicing and processing; Figure 3.1 in Chapter 3). As a con- 
sequence, the notion of a gene becomes somewhat elastic. Some genes occur 
within the introns of other genes, for example OMG, EVI2A, EVI2B within the 
neurofibromatosis type 1 (NF1; 17q11.2; Viskochil et al., 1991) gene, F8A within 
the factor VIII (F8C; Xq28) gene (Levinson et al., 1990), and U21 within the L5 
genes of chickens and mammals (Qu et al., 1994). The genes of most known verte- 
brate small nucleolar mRNAs (snoRNAs) are located within the introns of other 
genes (Maxwell and Fournier 1995), two human examples being the RNE/ and 
RNE2 genes which are located within the mitotic regulator (CHC1/; 1p36.1) and 
the 67 kDa laminin receptor (LAMR1; 3p21.3) genes, respectively. The realiza- 
tion that some genes reside within the introns of other genes makes the concept of 
the gene that much more diffuse. 

Human genes can also overlap in a number of different ways. Thus, two genes 
encoding erbA homologues, ear-1 (THRAL) and ear-7 (THRA), located at the 
same locus on chromosome 17, possess overlapping exons but are transcribed 
from opposite DNA strands (Miyajima et al., 1989). Similarly, the tenascin-X 
(TNXA; 6p21.3) gene overlaps with the last exon of the cytochrome P450 
(CYP21; 6p21.3) gene on the opposite DNA strand (Speek et al., 1996). Other 
examples of overlapping human genes transcribed from opposite DNA strands 
are provided by the PMS2 (7p22) gene and a gene encoding a 34.5 kDa polypep- 
tide (Nicolaides et al., 1995), the CD3 ¢/n/0 (CD3Z) and Oct1 transcription factor 
(POU2FI) genes on 1q22-q23 (Lerner et al., 1993), and the cytochrome c oxidase 
subunit X (COX10) gene and a partially characterized cDNA (C170RF1) on 
chromosome 17p12-p11.2 (Kennerson et al., 1997). An example of overlapping 
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human genes transcribed from the same strand is provided by the prematurely 
termed growth hormone gene-derived transcriptional activator (GHDTA) gene 
and the growth hormone 1 (GH1; 17q22-q24) gene (Labarrière et al., 1995). The 
GHDTA gene is transcribed from position -197 in the GH/ gene promoter and 
contains an open reading frame that extends from the ATG at -151 to a Stop 
codon in exon 2 of the GHI gene. Another kind of overlap is exemplified by the 
transglutaminase 1 (TGM1; 14q11.2) gene some of whose cis-acting regulatory 
sequences reside within the transcribed portion of the functionally unrelated Rab 
geranylgeranyl transferase œ subunit (RABGGTA; 14q11.2) gene located <2kb 
upstream of the TGM] transcriptional initiation site (van Bokhoven et al., 1996). 
Clearly, the existence of overlapping genes can hamper any attempt to demarcate 
precisely and unambiguously where one gene ends and another begins. 

Some genes such as the immunoglobulin and T-cell receptor genes (see 
Chapter 4, section 4.2.4) differ in structure between different cell types. Which 
is the gene; that which is present in the germline or that which is rearranged to 
perform a specific function in the soma? 

Should distant regulatory sequences (such as LCRs, see section 1.1.2, Locus con- 
trol regions) and protein binding sites that maintain chromosomal conformation be 
included within the boundaries of a gene? Indeed, if we are prepared to entertain 
radical redefinition of what constitutes a gene, should all parts of a gene necessar- 
ily be contiguous on the same DNA strand or even the same chromosome? 

At the time of writing, the majority of human gene transcripts still remain to be 
characterized. However, Adams et al. (1995) attempted to ascribe functions to 
cDNAs by limited DNA sequence analysis and detection of homologies to known 
proteins. These data are summarized in Table 1.1. Of the human cDNAs studied by 
Adams et al. (1995), which probably represent no more than 10% of the total num- 
ber, only eight of the corresponding genes were expressed in all 30 tissues exam- 
ined. Some 227 genes were expressed in >20 tissues whilst some 4300 genes were 
found to be expressed in only one tissue. Clearly such data are extremely useful for 
any conceptual discussion of what we mean by tissue specificity of gene expression. 

Although the primary control of gene expression is usually exerted at the level 
of transcription, the regulation of gene expression may also occur at several other 
different stages in the pathway including transcriptional activation, mRNA splic- 
ing, stability, export, translation (synthesis of the protein product), post-transla- 
tional processing, and export of the mature protein (reviewed by Atwater et al., 


Table 1.1. Putative functions of a sample of human genes (after Adams et al., 1995) 


Putative function Proportion of transcripts (%) 
Cell signalling 12 
RNA synthesis/processing 6 
Protein synthesis/processing 15 
Metabolism 16 
Cell division/DNA synthesis 4 
Cell structure/mobility 8 
Cell/organism defence/homeostasis 12 


Unclassified 24 
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1990; Hentze 1991; Figure 1.2). A variety of sequences within a typical gene region 
are required for normal and appropriate expression to occur. These are described 
briefly in subsequent sections. 


Polymorphisms 


Variation, whatever may be its cause, and however it may be limited, is the 
essential phenomenon of evolution. Variation, in fact, is evolution. The readiest 
way, then, of solving the problem of evolution is to study the facts of variation. 
William Bateson (1894) Materials for the Study of Variation 


The term polymorphism has been defined (Vogel and Motulsky, 1986) as a ‘mendelian 
trait that exists in the population in at least two phenotypes, neither of which occurs 
at a frequency of less than 1%’. Some DNA polymorphisms are neutral single base- 
pair changes detected by virtue of the consequent introduction or removal of a 
restriction enzyme recognition site and are accordingly termed Restriction 
Fragment Length Polymorphisms (RFLPs). They are inherited as simple 
Mendelian traits since two alleles are generated as a consequence of the presence or 
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absence of each restriction site. RFLPs are not rare, being distributed throughout 
the genome at a frequency of between 1/200 and 1/1000 bp (Collins et al., 1997; 
Cooper et al., 1985; Li and Sadler, 1991; Wang et al., 1998). Some 25% of single 
nucleotide polymorphisms (SNPs) in higher primates occur in CpG dinucleotides 
(Savatier et al., 1985; Yang et al., 1996), consistent with a model of methylation- 
mediated deamination. 

Not unexpectedly, the vast majority of polymorphisms occur in introns or inter- 
genic regions rather than within coding sequences and may thus be expected to be 
neutral with respect to fitness (Bowcock et al., 1991). Those polymorphisms that 
occur either within coding regions (see Chapter 2, section 2.3.7) or in the promoter 
region (see Chapter 5, section 5.1.9) may however affect either the structure or func- 
tion of the gene product or the expression of the gene and may therefore have the 
potential to be of phenotypic or even pathological significance (Cooper and 
Krawczak, 1993). Those coding sequence polymorphisms that alter the amino acid 
sequence of the encoded protein are found at a lower rate and with lower allele fre- 
quencies than silent substitutions (Cargill et al., 1999). This probably reflects the 
action of negative selection on deleterious alleles during human evolution. 

Some polymorphisms may be missense mutations, for example those underlying 
the Lewis Le alleles in the FUT3 gene (19p13.3; Nishihara et al., 1994) or the 
Arg/Gln 353 polymorphism in the factor VII (F7; 13q34) gene (see Chapter 2, sec- 
tion 2.3.7). Others are nonsense mutations that serve to inactivate the gene in ques- 
tion, for example the secretor se allele in the FUT2 gene (19cen-qter) present in 20% 
of the population (Kelly et al., 1995). Further types of gene-associated polymor- 
phism in the human genome include triplet repeat copy number (see Chapter 8, sec- 
tion 8.9), gene deletions (see Chapter 8, section 8.1), gene duplications (see Chapter 
8, section 8.5), intragenic duplications (see Chapter 8, section 8.6), micro-insertions 
(see Chapter 8, section 8.3), inversions (see Chapter 9, section 9.1), gene fusion (see 
Chapter 9, section 9.3), and gene copy number (see Chapter 8, sections 8.1 and 8.5). 
Various databases of human DNA polymorphisms are available for online consulta- 
tion: The Genome Database (http://gdbwww.gdb.org/), the Database of Single 
Nucleotide Polymorphisms (http://www.ncbi.nlm.nih.gov/SNP/), and the database 
of Human Genic Bi-Allelic Sequences (http://hgbase.interactiva.de/intro.html). 

The mechanisms by which polymorphisms are maintained in human popula- 
tions are likely to be varied. The neutralist theory assumes no selection on the 
alleles of a polymorphic locus and the frequency of an allele may therefore 
increase simply by genetic drift (the change of allele frequency due to random 
sampling). Such ‘transient polymorphisms’ often remain at a low frequency in 
the population before being lost or may instead increase in frequency under the 
influence of either genetic drift or positive selection until one allele reaches fix- 
ation. Most known polymorphisms are probably of this type. However, if the 
alternative alleles are not neutral with respect to fitness (see Chapter 2, section 
2.3.7), the DNA polymorphisms may be maintained by selection pressure, possi- 
bly overdominant selection (‘balanced polymorphisms’). Finally, a ‘hitchhiker 
effect? may operate if the polymorphisms are closely linked to a locus which is 
itself under strong selection. In some special cases, such as polymorphisms 
within CpG dinucleotides, recurrent mutation may serve to maintain the allele 
frequency in the population. 
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The alleles of closely linked polymorphisms often occur in specific combinations 
or haplotypes. In the B-globin (HBB; 11p15) gene, for example, some haplotypes 
have been identified in Europeans, Asians, Blacks, and Chinese, indicating that 
their origin must have predated racial divergence (Antonarakis et al., 1985). By 
contrast, other haplotypes are population-specific or ‘private’ (e.g. Schneider et 
al., 1998; Wainscoat et al., 1986). In some cases, the number of different haplo- 
types can be quite high, the result of the shuffling of different single base-pair 
substitutions by such processes as recombination and gene conversion (Fullerton 
et al., 1994). 

In their analysis of 8 kb of intronic sequence of the Duchenne muscular dys- 
trophy (DMD; Xp21) genes of individuals from 13 different human populations 
[European, Papua New Guinean, African (6), Asian (3) and Amerindian (2)], 
Zietkiewicz et al. (1997) identified 36 polymorphisms. Of these, 15 were shared 
among most of the populations screened, 13 were confined to Africans and four 
were confined to non-Africans. A detailed study of human DNA polymorphism 
has also been performed on a 9734 bp region of the human lipoprotein lipase 
(LPL; 8p22) gene, comprising 8736 bp intronic sequence and 998 bp coding 
sequence, performed on 142 chromosomes from 71 individuals from three dis- 
tinct populations: Finns, European-Americans and African-Americans 
(Nickerson et al., 1998). A total of 88 sites were found to vary (79 single base-pair 
substitutions and nine microdeletions or microinsertions) representing an aver- 
age of one variable site every 500 bp. Only seven of the 88 variable sites were 
found in the coding region. At 34 of the 88 sites, the variation was found in only 
one of the three populations reflecting the differing population and mutational 
histories. A total of 88 unique haplotypes were identified in the 142 chromosomes 
sampled, probably a reflection of the complex historical interplay of recombina- 
tion and mutation. Finally, a study of a total of 87 kb of human chromosome Xq22 
revealed 102 polymorphisms, seven of which were shared by Europeans, 
Ashkenazim, and pygmies, two by pygmies and Europeans, and 19 by 
Ashkenazim and Europeans (Anagnostopoulos et al., 1999). 

Some polymorphisms within human gene coding regions also occur in the 
orthologous genes of the great apes. In some cases, this may imply an origin for 
the polymorphism that predated the adaptive radiation of the higher primates 
(‘trans-species polymorphism’). Perhaps the best example of this phenomenon is 
provided by the ABO blood group locus (ABO; 9q34.1-q34.2). In humans, there 
are four amino acid substitutions (in positions 176, 235, 266 and 268) that distin- 
guish the A- and B-transferases (respectively R, G, L, and G in A-transferase and 
G, S, M, and A in B-transferase) whilst the O allele is characterized by a single 
nucleotide (G,,,) frameshift deletion that inactivates the protein (Yamamoto et al., 
1990). The gorilla and chimpanzee possess types B and A respectively whilst the 
baboon (Papio cynocephalus) exhibits both A and B (Yamamoto et al., 1990; 
Kominato et al., 1992; Martinko et al., 1993). Certain alleles of the major histo- 
compatibility complex (MHC) are also very ancient. Indeed, it would appear that 
some of the class I MHC allelic lineages are shared by human, chimpanzee and 
gorilla, implying that these polymorphisms were present in the common ancestor 
of the three species (see Chapter 4, section 4.2.1, Genes of the major histocompatibil- 
ity complex). Another possible example of trans-species polymorphism is to be 
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found in the complement C4 genes (C4A, C4B; 6p21.3) of humans, chimpanzees 
and orangutans (Kawaguchi et al., 1990; Kawaguchi and Klein 1992; Paz-Artal et 
al., 1994). Finally, trans-species polymorphism may also be present in the cone 
visual pigment genes, RCP and GCP (Xq28; Deeb et al., 1994) but in this case, the 
evidence must be regarded as more equivocal. For polymorphisms to have sur- 
vived the speciation process, they must have resisted fixation even during severe 
population bottlenecks. Neutral alleles would have been unlikely to survive the 
process of speciation and so the long-term persistence of polymorphisms must be 
explicable by selection. In the case of MHC and ABO alleles, allelic lineages have 
predated human speciation and pathogen-driven overdominant selection (heterozy- 
gote advantage) has probably been the factor that has served to maintain the pres- 
ence of alternative alleles (see Chapter 4, section 4.2.1, Genes of the major 
histocompatibility complex). In other cases, it is conceivable that identical mutations 
could have occurred independently in the different lineages thereby mimicking 
true trans-species polymorphism. 

Trans-specific polymorphism is likely to be the exception rather than the rule 
since, in the absence of positive selection, polymorphisms are most unlikely to 
survive speciation. Thus an average of 4N, generations (N, = effective population 
size) are required for a newly emerged allele to become fixed in a population as a 
consequence of random drift. Since until recently the effective population size of 
modern humans has been ~10 000 individuals, most neutral human polymor- 
phisms may be expected to be less than 800 000 years old (assuming a 20 year gen- 
eration time), rather less than the 5—6 Myrs that have elapsed since the divergence 
of the human and chimpanzee lineages. Hacia et al. (1999) studied the orthologous 
sequences from chimpanzees (both species) and gorilla for a total of 397 human 
single nucleotide polymorphisms (SNPs). They were able to determine the alleles 
in 214 of these SNP sites and of these, three segregated for the same nucleotides 
in both humans and pygmy chimpanzees whilst two segregated in humans and 
gorillas. However, 4/5 of the shared polymorphic sites occurred at hypermutable 
CpG dinucleotides suggesting that recurrent mutation rather than identity-by- 
descent was the cause of the shared polymorphism. 


Functional organization of human genes. Well before the organization of the 
human genome is known in its entirety, some trends governing the distribution of 
genes have become apparent (McKusick, 1986; Schinzel et al., 1993; Strachan, 
1992) and these are discussed briefly below. 

Genes which encode the same product (e.g. ribosomal RNA (RNR), histones (H/F2, 
H1F3, H1F4, H2A, H2B, H3F2, H4F2), HLA, homeobox proteins (HOXA, 
HOXB, HOXC, and HOXD), immmunoglobulins (IGK, IGL, IGH)) are often clus- 
tered. However, these clusters are usually distributed between several different 
chromosomes (e.g. RNR, chromosomes 13, 14, 15, 21, and 22; histones, chromo- 
somes 1, 6, and 12; HOX, chromosomes 2, 7, 12, and 17). Thus, multigene families 
have evolved by duplication and divergence but the duplicated copies may no 
longer be syntenic (i.e. linkage on the same chromosome conserved) owing to the 
translocation of discrete chromosomal regions during evolution. The likelihood of 
synteny may well be related to the time that has elapsed since duplication. 

Genes which encode tissue-specific protein isoforms or isoenzymes are sometimes clus- 
tered (e.g. pancreatic (AMY2A, AMY2B) and salivary (AMYIA) amylase genes on 
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chromosome 1p21) but sometimes not (e.g. cardiac (ACTC), skeletal muscle 
(ACTA1), smooth muscle, aorta (ACTA2) and smooth muscle, enteric (ACTG2) 
a-actin genes which are located on chromosomes 15, 1, 10, and 2, respectively). 
Again, the time that has elapsed since the duplication event is likely to be an 
important factor in determining whether or not the duplicated genes have 
remained syntenic. 

Genes encoding isozymes specific for different subcellular compartments are usually not 
syntenic [e.g. soluble/extracellular and mitochondrial forms of superoxide dismu- 
tase (SOD1/SOD3, SOD2 on chromosomes 21/4 and 6, respectively), aconitase 
(ACO1, ACO2 on chromosomes 9 and 22) and thymidine kinase (TK1, TK2 on 
chromosomes 17 and 16)]. 

Genes encoding enzymes catalyzing successive steps in a particular metabolic pathway are 
usually not syntenic. Thus, the five enzymes of the urea cycle are encoded by genes 
(ARGI, ASL, ASS, CPS1, OTC) on chromosomes 6, 7, 9, 2, and X and four 
enzymes involved in galactose metabolism are encoded by genes (GALE, GALK1, 
GALK2, GALT) on chromosomes 1, 17, 15, and 9. However, there are exceptions to 
this rule: four genes encoding enzymes of the glycolytic pathway (TPI, GAPD, 
ENO2, LDHB) are located on the short arm of chromosome 12 in the region p13- 
p12. Similarly, the GDH and PGD genes encoding enzymes of the phosphoglu- 
conate pathway are encoded by linked genes on chromosome 1. The reasons for this 
syntenic organization, when it occurs, are usually unclear but we may nevertheless 
surmise the evolutionary history of the genes involved. Indeed, as early as 1945, 
Horowitz proposed that genes encoding enzymes of metabolic pathways could have 
arisen by serial duplication. His idea was that the protein that would eventually 
become the terminal enzyme of a given metabolic pathway would possess a binding 
site for the substrate that it used. A novel protein derived from the duplicated gene 
could still use this binding site for interaction with the same substrate molecule but 
could evolve the capability of producing it as a product from another source. Thus, 
the substrate for the terminal enzyme would become an intermediate in the devel- 
oping metabolic pathway. In this way, the various enzymes of a pathway would 
evolve from each other in the reverse order to that in which they appear in the mod- 
ern pathway. The synteny observed in the cases cited above might be a consequence 
of conservation resulting from a requirement for coordinate regulation of the loci 
concerned. Some of these principles also appear to hold for the genes encoding the 
enzymes of the coagulation cascade (see Chapter 10). 

Genes encoding different subunits of a heteromeric protein are often not syntenic (e.g. the 
genes encoding o-globin (HBAI, HBA2, chromosome 16) and £-globin (HBB, 
chromosome 11), lactate dehydrogenases A (LDHA, chromosome 11) and B 
(LDHB, chromosome 12), factor XIII subunits a (FJ3A, chromosome 6) and b 
(F13B, chromosome 1), the immunoglobulin light chains (IGK, IGL, chromo- 
somes 2 and 22), and heavy chains (IGH, chromosome 14). However, several cases 
of synteny are known e.g. the genes encoding the three chains of fibrinogen (FGA, 
FGB, FGG, all closely linked on chromosome 4), the o- and B-chains of C4b-bind- 
ing protein (C4¢BPA and C4BPB, closely linked on chromosome 1q32), the com- 
plement component 1Q a- and B-chains (C1QA, CIQB, linked on chromosome 
lp) and the platelet membrane glycoproteins IIb and IIIa UTGA2B, ITGAS3, 
closely linked on chromosome 17q21-q22). The clustering of these genes may be 
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important for their coordinate regulation by common control elements (see 
Chapter 5, section 5.1.14). The various subunits of the T-cell antigen receptor are 
intriguing: the æ- and -subunits are encoded by genes (TCRA, TCRD) on chro- 
mosome 14, the B- and y-subunit genes (TCRB, TCRG) are on chromosome 7 
whereas the €-subunit gene (TCRE) lies on chromosome 11. Whilst synteny prob- 
ably implies a common evolutionary origin, lack of synteny does not necessarily 
argue against it. 

Clustering of genes of similar function and common evolutionary origin is common, for 
example the genes encoding blood coagulation factors VII (F7) and X (F10) on 
chromosome 13q34, the y-crystallin (CRYG) gene cluster on chromosome 2q33 
and the six alcohol dehydrogenase (ADH1, ADH2, ADH3, ADH4, ADHS, 
ADH7) genes on chromosome 4q22. Many more examples are given in Chapter 4. 

Genes do not usually exhibit chromosomal clustering with respect to the structure/func- 
tion of particular organs or subcellular organelles (e.g. mitochondria). However, vari- 
ous genes encoding proteins expressed in the course of epidermal differentiation 
{involucrin (IVL), loricrin (LOR), filaggrin (FLG), the small proline-rich pro- 
teins (SPRRIA, SPRRIB, SPRR2A, and SPRR3), trichohyalin (THH)] are clus- 
tered together on chromosome 1q21 thereby betraying their common 
evolutionary origin (Volz et al., 1993). A considerable number of the genes encod- 
ing various cytokines (including several hematopoietic growth factors) and their 
receptors are clustered on the long arm of chromosome 5: granulocyte- 
macrophage colony-stimulating factor (GMCSF), macrophage colony-stimulat- 
ing factor (CSF2), the CSF1 receptor (CSFIR, colony-stimulating factor-1 
receptor, also known as c-fms), interleukins 3, 4, 5, 9, 12B, and 13 (IL3, IL4, ILS, 
IL9, IL12B, IL13), platelet-derived growth factor receptor-B (PDGFRB), acidic 
fibroblast growth factor (FGF/) and fibroblast growth factor receptor 4 (FGFR4). 
The genes encoding the IL3 receptor o-chain (IL3RA) and the GMCSF receptor 
a-chain (CSF2RA) both map to the pseudoautosomal region of the sex chromo- 
somes. Synteny betrays the common evolutionary origin of the genes as well as 
the probable mechanism—tandem duplication. 

Genes encoding ligands and their associated receptors are sometimes syntenic, for 
example the genes encoding transferrin (TF) and its receptor (TFRC) are both 
located on chromosome 3q whilst the genes encoding apolipoprotein E (APOE) 
and the low density lipoprotein receptor (LDLR) are both located on chromo- 
some 19. However, not surprisingly, this is far from always the case, for example 
insulin (ZNS, chromosome 11) and insulin receptor (INSR, chromosome 19); epi- 
dermal growth factor (EGE chromosome 4), epidermal growth factor receptor 
(EGFR, chromosome 7); growth hormone (GH1, GH2, chromosome 17) and 
growth hormone receptor (GHR, chromosome 5); interferons œ, B, y and ol 
(IFNA, IFNB1, IFNG, IFNW1, chromosomes 9, 8 and 9, 12, 9), interferon recep- 
tors /B/w and y JFNARI, IFNGRI1, chromosome 6). 

The linear order of members of a family of related genes can reflect the order in which 
they become activated during development, for example the HBEI (embryonic), 
HBG2, HBGI (fetal), HBD, HBB (postnatal) genes of the human -globin clus- 
ter. The expression of these genes is controlled by an upstream locus control 
region (LCR; see section 1.1.2, Locus control regions) and correct gene order is 
required for the normal temporal pattern of developmental expression 
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(Hanscombe et al., 1991). Another example of this phenomenon is provided by the 
HOX genes which are organized chromosomally according to their order of 
expression (van der Hoeven et al., 1996). Intriguingly, the encoded Hox proteins 
vary in their affinity for their target DNA sequences and these affinities also cor- 
relate with the linear order of the genes (Pellerin et al., 1994). By contrast, the 
order of the human ADH genes on chromosome 4q21, 5'-ADH3-ADH2-ADHI1- 
3’, is opposite to their order of transcriptional activation in hepatic development 
(Yasunami et al., 1990) although the significance of this observation, if any, is 
unknown. The reflection of temporal order of expression in the physical order on 
the chromosome is not however a universal phenomenon as is evidenced by the 
human and murine myosin heavy chain (MYH) gene clusters on chromosomes 17 
and 11, respectively (Weiss et al., 1999). Finally, it may be significant that the 
order of four human mucin genes (MUC2, MUCSAC, MUCSB, MUC6) on chro- 
mosome 11p15 corresponds to their order in terms of the anterior-posterior axis 
of the epithelial areas where they are preferentially expressed (Pigny et al., 1996). 


Pseudogenes. Pseudogenes are DNA sequences which are closely related to func- 
tional genes but which are incapable of encoding a protein product on account of 
the presence of deletions, insertions and nonsense mutations which abolish the 
reading frame or otherwise prevent gene expression (reviewed by Wilde, 1985; see 
Chapter 6 for an in-depth treatment). Some human pseudogenes are transcribed 
(e.g. Bristow et al., 1993; Nguyen et al., 1991; Takahashi et al., 1992) but these tran- 
scripts are not translated. There are two major types of pseudogene: the first arises 
through the duplication and subsequent inactivation of a gene (see Chapter 6, sec- 
tion 6.1.1). This type of pseudogene retains the exon/intron organization of the 
parental gene and are often closely linked to the parental gene. Examples include 
the pseudogenes in the æ- and f-globin clusters (e.g. Cheng et al., 1988). The sec- 
ond type of pseudogene contains only the exons of the parental gene, usually pos- 
sess a poly(A) tail at the 3’ end and are dispersed randomly in the genome (see 
Chapter 6, section 6.1.2). These processed genes are thought to have originated as 
mRNAs which have then become integrated into the genome by retrotransposition 
(i.e. the reverse transcription of the mRNA and the integration of the resulting 
cDNA). 

Pseudogenes are relatively common in the human genome (McAlpine et al., 
1993; many hundreds are known) and may be especially prevalent in multigene 
families (e.g. B-globin, actin, HLA, interferons, snRNAs, keratins, T-cell recep- 
tors, immunoglobulin gene clusters; see Chapter 6). However, single copy genes 
may also have multiple pseudogenes (e.g. prohibitin, PHB, four pseudogenes; 
argininosuccinate synthetase, ASS, 14 pseudogenes). 


Promoter elements. The archetypal gene contains promoter elements upstream 
(5’) of the transcriptional initiation site (i.e. the beginning of the mRNA) which 
serve to specify the temporal and spatial pattern of expression of the downstream 
gene and define its potential for induction by external stimuli (Figure 1.3). Some 
genes contain multiple alternative promoters which are utilized in a tissue-specific 
fashion (e.g. DMD; Nishio et al., 1994; Figure 1.4; reviewed by Ayoubi and van de 
Ven, 1996). Some promoters may be located within an intron (e.g. within the first 
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Figure 1.3. Transcriptional control elements upstream of the transcriptional initiation site 
in the human heat shock protein 1 (HSPAIA; 6p21.3) and metallothionein 2A (M72A; 
16q13) genes. The TATA, SP1, and CCAAT boxes bind transcription factors that are 
involved in constitutive transcription whilst the glucocorticoid response element (GRE), 
metal response element (MRE), heat shock element (HSE), and the AP1 and AP2 sites 
bind factors involved in the induction of gene expression in response to specific stimuli. 
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Figure 1.4. Alternative promoter usage in the human dystrophin (DMD) gene (redrawn 
from Strachan and Read 1996). The alternative promoters are: C, cortical; M, muscle; P, 
Purkinje; R, retinal; CNS, central nervous system; S, Schwann cell; G, Glial cell. The 79 
exons of the DMD gene are denoted by bars. The first exon used to encode each isoform 
is given the suffix 1, that is S1 denotes the first exon incorporated as a result of Schwann 
cell promoter usage. Dystrophin isoforms are denoted by Dp acronyms. 


intron of the human ADA gene; Aronow et al., 1989). Some promoters may even be 
found buried within the introns of other genes (e.g. an element of the human 
CYP21 gene promoter lies within intron 35 of the C4A gene; Tee et al., 1995). 
The transcriptional initiation site is usually preceded by constitutive promoter ele- 
ments of defined sequence, for example the TATA box (TATAAA; 25-30 bp 5’ to the 
cap site), initiator element (Py Py A,, N T/A Py Py) which serves as a functional 
analogue of the TATA box in TATA-less promoters (Lo and Smale 1996) and the 
CCAAT motif (~90 bp 5’ to the cap site) which potentiate a basal level of gene 
expression. Further upstream regulatory motifs bind tissue-specific transcription 
factors, some binding different factors in different tissues. Response elements, as the 
name suggests, are able to confer transcriptional responsiveness to various external 
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trigger stimuli such as temperature, hormones and growth factors. In this way, gene 
expression is rendered responsive to both the internal (cellular) and external (to the 
organism) environment. Together, these upstream DNA sequences are involved in 
controlling gene regulation and induction and conferring tissue-specificity of 
expression (reviewed by Mitchell and Tjian, 1989). 

A large number of DNA sequence motifs have now been identified within gene 
promoters which represent binding sites for DNA-binding proteins (Johnson and 
McKnight, 1989; Kel et al., 1995; see section 1.1.2, Trans-acting protein factors). 
These protein-DNA interactions are required to confer appropriate regulation 
upon the genes bearing them (reviewed by Clark and Doherty, 1993; Faisst and 
Meyer, 1992; Freemont et al., 1991; Latchman, 1990). Removal of these DNA 
sequence motifs abolishes the gene’s specific pattern of spatial and/or temporal 
expression. 


Enhancers. Enhancers are DNA sequences that are present 5’ or 3’ to a gene (or 
within an exon or an intron) and which are capable of activating the transcription 
of the gene in a tissue-specific fashion, independently of their orientation and dis- 
tance from the correct initiation site (Müller et al., 1988; Wasylyk, 1988). Collins 
et al. (1998) reported the unusual case of the mouse thrombospondin 3 (Thbs3) 
enhancer that is located far upstream of the 7hbs3 gene, within intron 6 of the 
divergently transcribed metaxin (Mtx) gene. 

Enhancers function by acting as templates for the assembly of multiprotein 
complexes on the gene promoter. Thompson and McKnight (1992) likened 
enhancer-protein interactions to a three-dimensional jigsaw puzzle: 


‘the arrangement of regulatory motifs forms the puzzle template. Specific reg- 
ulatory proteins, by forming contours of appropriate fit for both DNA tem- 
plate and neighbouring proteins, constitute the puzzle pieces.’ 


Mechanistic models involving the looping out of DNA between the enhancer and 
the transcriptional initiation complex have been proposed in order to explain how 
enhancers manage to influence the activity of their target promoters at consider- 
able distances (Ptashne, 1988). 


Negative regulatory elements. The negative regulation of transcription by 
repressors or silencers is now so well documented (reviewed by Clark and Doherty, 
1993; Herschbach and Johnson, 1993; Jackson, 1991; Levine and Manley, 1989) 
that it is likely that many if not most genes are subject to their inhibitory influ- 
ence. Indeed, every gene promoter region is likely to possess its own unique com- 
bination of positive and negative regulatory elements which serves to determine 
its temporal and spatial pattern of expression. These elements permit the binding 
of a specific set of DNA-binding proteins which are thereby brought into suffi- 
ciently close proximity so as to allow their interaction both with each other and 
with the RNA polymerase in order to influence transcription either positively or 
negatively. Negative regulation therefore serves to prevent the expression of a 
gene in an inappropriate tissue or at an inappropriate time or at an inappropriate 
level. They also potentiate the down-regulation of the expression of a gene fol- 
lowing its transient induction. Negative regulatory elements have been found not 
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merely in the promoter regions of human genes as in the ol-antitrypsin (PJ; 
14q32.1) gene (De Simone and Cortese, 1989) but also in the first exon [osteocal- 
cin (BGLAP;; 1q25-q31; Li et al., 1995) gene], the first intron [(CD4; 12pter-p12; 
Donda et al., 1996) gene], or even the third intron [type IV collagen (COL4A2; 
13q34; Haniel et al., 1995) gene] of a gene. 


Locus control regions. Regulatory elements that exert their effects on the expres- 
sion of downstream genes over great distances have been described in the human a- 
globin (HBAI; 16p13.3), B-globin (HBB; 11p15), growth hormone (GHI; 
17q22-q24) and red/green cone pigment (RCP, GCP; Xq28) genes among others 
(Hanscombe et al., 1991; Jarman et al., 1991; Jones et al., 1995; Wang et al., 1992; 
Figure 6.2 in Chapter 6). These are known as locus control regions (LCRs). The LCR 
located 40 kb upstream of the HBB gene is essential for the high level, tissue-spe- 
cific expression of the HBB gene. It is believed to contain an enhancer capable of 
controlling the replication timing of the B-globin gene cluster, organizing it into an 
active chromatin domain and directing the expression of downstream sequences in 
an erythroid-specific and developmental stage-specific fashion (Higgs, 1998; 
Orkin, 1995). Although conserved between mammals, little homology is apparent 
between mammalian and avian B-globin LCR regions (Hardison, 1998). 


Trans-acting protein factors. As we have seen, the transcriptional activation of 
eukaryotic genes is made possible by the interaction of trans-acting protein factors 
with cis-acting DNA sequence motifs including enhancers. These transcription 
factors typically contain a sequence-specific DNA-binding domain, a multimer- 
ization domain which allows the formation of either homomultimers or hetero- 
multimers, and a transcriptional activation domain. These domains can be 
combined in modular fashion to generate an array of different transcription fac- 
tors (Latchman, 1998; Tjian and Maniatis, 1994). 

It is the cis-acting DNA sequences within a gene promoter that allow transcrip- 
tion factors to be brought into close proximity so that they may either interact 
with each other in the transcriptional initiation complex or combine together in 
an enhancer complex. No one enhancer-binding protein can act on its own, rather 
it must act in concert with other enhancer-binding proteins. For example, one 
factor may induce a bend in the DNA thereby promoting the interaction of two 
already bound proteins with each other. Once the enhancer complex is assembled, 
it must be able to interact either directly or indirectly with the basal transcription 
apparatus via its activation domain (Ptashne, 1988; Ptashne and Gann, 1990; 
Pugh and Tjian, 1990). 

Transcription factors can be grouped into families of related proteins whose 
relatedness extends to homology in their DNA-binding domains and therefore an 
ability to bind to related DNA sequences (reviewed by Pabo and Sauer, 1992). 
Specific transcription factors can thus bind to more than one DNA sequence. 
Conversely, a single DNA sequence motif may sometimes be bound by more than 
one transcription factor. DNA-binding domains fall into one of four main groups 
defined by homologous amino acid sequences that give rise to a particular struc- 
ture capable of binding DNA: homeodomain, zinc finger, leucine zipper and 
helix-loop-helix (Pabo and Sauer, 1992). These domains usually bind the negatively 
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charged DNA molecule through basic amino acid residues. Activation domains of 
transcription factors also come in four types: rich in either glutamine, proline, 
serine/threonine, or acidic amino acids. Examples of these types are SP1, 
CTF/NF-1, Pit-1, and GATA-1, respectively (reviewed by Ptashne, 1988; Ptashne 
and Gann, 1990). 

Transcriptional repressor proteins may bind directly to DNA or exert their 
repressive effects indirectly by interacting with basal transcription factors, tran- 
scriptional activators and co-repressor proteins to inhibit transcription by RNA 
polymerases (Hanna-Rose and Hansen, 1996). 

The study of DNA-protein interactions has been enormously facilitated by the 
use of two techniques: gel retardation analysis (also termed band or mobility shift 
assays; Dent and Latchman, 1993) and DNase I footprinting (Lakin, 1993). The 
former technique is extremely useful for searching for DNA-binding proteins in 
crude nuclear extracts whereas the latter method provides information as to the 
precise location of the binding site on the DNA sequence under study. 


5’and 3’ untranslated regions. The presence of the 5’ untranslated region (5’ 
UTR; the sequence lying between the transcriptional initiation site and the 
translational start codon, ATG; Figure 1.1) is often essential for the normal 
expression of a gene. Sequences in the 5’ UTRs of various genes are thought to 
play a role in controlling the translation of the encoded mRNA (reviewed by 
Curtis et al., 1995; Melefors and Hentze, 1993; Pesole et al., 1997; Sachs, 1993). 
Perhaps the best characterized posttranscriptional control mechanism involving 
the 5’ UTR is that of the iron response element (IRE). The IRE is found in the 
5’ UTRs of several human genes [e.g. ferritin (FTH/), transferrin (TF), erythroid 
5-aminolevulinate acid synthase (ALAS/); Cox and Adrian 1993; Bhasker et al., 
1993] and is capable of adopting a stem-loop structure that interacts with a 
cytosolic RNA-binding protein thereby inhibiting mRNA translation. The post- 
transcriptional regulation of several other human genes [e.g. transforming 
growth factor-B1 (TGFBI; Kim et al., 1992b) and basic fibroblast growth factor 
(FGF2; Prats et al., 1992)] also appears to involve regulatory elements that mod- 
ulate the efficiency of translation. 

Zhang (1998) examined the nature of translational start codons in human 
genes. The relative frequencies of start codons occurring as the first, second, third 
or fourth ATG codons in the reading frame were 474, 51, 5, and 0, respectively. 
There are, however, quite a few examples of more than one ATG being used in the 
same gene. Thus, the alternative use of two ATG codons 84 bp apart in the human 
peroxisome proliferator-activated receptor (PPARG; 3p25) gene serves to gener- 
ate two distinct protein isoforms that differ in length by 28 amino acids at the 
amino terminal end of the protein (Elbrecht et al., 1996). The alternative use of 
different ATG initiation codons has also been reported for the human von Hippel 
Lindau (VHL; 3p25-p26) gene which leads to the production of two distinct pro- 
tein products that differ in length by 53 amino acids (Blankenship et al., 1999). 

Zhang (1998) reported that the ratio of the frequencies of stop codons TAA, 
TAG and TGA in human genes was about 1:1:2. The 3’ untranslated region is 
that region of a gene which lies downstream of the stop codon. The AATAAA 
sequence located downstream of the stop codon (TAA in Figure 1.1) of ~90% of 
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genes (i.e. in the 3’ untranslated region) controls cleavage/polyadenylation (addi- 
tion of a poly A tail 10-30 bp downstream) which is vital for mRNA stability. The 
AAUAAA motif is thought to interact with at least two RNA-binding proteins, 
cleavage-polyadenylation specificity factor (CPSF) and cleavage-stimulatory fac- 
tor (CstF), which are thought to promote assembly of the polyadenylation com- 
plex (Manley and Proudfoot 1994). Some genes possess multiple polyadenylation 
sites which can be used alternatively (Edwalds-Gilbert et al., 1997). 

Other sequences have been implicated in determining mRNA stability, for 
example a GU-rich motif (consensus sequence, YGUGUUYY), present 20-40 
bases downstream of the cleavage site in ~70% of mammalian pre-mRNAs, that 
plays a crucial role in 3’ end formation (Decker and Parker, 1994; Manley and 
Proudfoot, 1994; Ross, 1996; Sachs, 1993). A further U-rich element, upstream of 
the AAUAAA poyladenylation motif is also involved. Some short-lived mRNAs 
possess AU-rich regions which specify instability (Chen and Shyu, 1995). Still 
other sequences may be involved in mRNA export (Izaurralde and Moottaj, 1995), 
nucleocytoplasmic transfer (Colgan and Manley, 1997), intracellular localization 
(St Johnston, 1995) and in promoting translational efficiency (Curtis et al., 1995). 
A wide variety of RNA-binding proteins are now known which play an important 
role in the post-transcriptional regulation of gene expression (reviewed by Burd 
and Dreyfuss, 1994; Siomi and Dreyfuss, 1997). 


Boundary elements. The position and orientation independence of enhancers 
and LCRs clearly raises the potential problem of the inappropriate activation of 
promoters from neighboring genes. Sequences which constrain the activity of 
enhancers were originally reported in mice and Drosophila (Kellum and Schedl, 
1991; reviewed by Eissenberg and Elgin, 1991). Termed ‘boundary elements’ or 
‘insulators,’ these sequences serve to insulate a gene from the effects of either 
enhancer or suppressor elements emanating from the surrounding chromatin. 
This insulator function appears to work either by blocking interactions with 
other sequences past the boundary element and/or by limiting the influence of an 
enhancer to the locality of its target gene(s) (Geyer, 1997). Boundary elements 
therefore serve as functional barriers when inserted between an enhancer and its 
downstream reporter gene. However, bracketing an enhancer—gene combination 
with boundary elements serves to confine the enhancer activity to that gene’s pro- 
moter such that its expression is maintained or even increased. MARs (see section 
1.1.1, Matrix attachment regions) may act as boundary elements as evidenced by 
their ability to establish transcriptional domains around transgenes and act as 
buffers to shield these transgenes from position effects (McKnight et al., 1992). 

A boundary element has been described 5’ to the LCR of the chicken B-globin 
gene cluster (hypersensitive site 4); it serves to insulate genes 5’ to it from the 
influence of the LCR, a function which it also manifests in both human and 
Drosophila cells (Chung et al., 1993). DNA around the DNasel hypersensitive site 
5 of the human £-globin gene appears to function in a similar fashion (Li and 
Stamatoyannopoulos, 1994). 

Another possible example of a boundary element in humans has been noted 
upstream of the coagulation factor X (F10;13q34) gene (Miao et al., 1992). This 
gene lies 2.8 kb downstream of the gene (F7) encoding the homologous clotting 
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protein, factor VII. Three positive regulatory elements (FXP3, FXP2, and FXP1) 
have been characterized upstream of the F10 gene. FXP1 and FXP2 act in an ori- 
entation- and position-independent fashion whilst FXP1 and FXP3 are responsi- 
ble for directing liver-specific gene expression. The putative boundary element 
just upstream of the FXP3 sequence is thought to prevent transcriptional activa- 
tion of the F7 gene by the F10 gene enhancers. 


Sequence motifs involved in mRNA splicing and processing. One of the 
characteristics of eukaryotic genes that distinguishes them from their prokary- 
otic counterparts is the production of large pre-mRNAs which contain interven- 
ing non-coding sequences (introns) that are removed by a highly accurate 
cleavage/ligation reaction known as splicing before the mRNA is transported to 
the cytoplasm for translation (reviewed by Green, 1986; Padgett et al., 1986). 
Splicing not only permits the removal of introns from the primary transcript but 
also allows the generation of different mRNAs from the same gene by alternative 
splicing, an important mechanism for tissue-specific or developmental regulation 
of gene expression and a very economical means of generating biological diver- 
sity (Nadal-Ginard et al., 1987; Norton, 1994). Alternative splicing may be regu- 
lated by variation in the intracellular levels of antagonistic splicing factors 
(Caceres et al., 1994). 

The splicing of a eukaryotic mRNA appears to occur as a two-stage process. In 
the case of a simple two exon gene, the pre-mRNA is first cleaved at the 5’ (donor) 
splice site to generate two splicing intermediates, an exon-containing RNA 
species and a lariat RNA species containing the second exon plus intervening 
intron. Cleavage at the 3’ (acceptor) splice site and ligation of the exons then 
occurs resulting in the excision of the intervening intron in the form of a lariat. 
Splicing efficiency is critically dependent upon the accuracy of cleavage and 
rejoining. This accuracy appears to be determined, at least in part, by the virtually 
invariant GT and AG dinucleotides present at the 5’ and 3’ exon/intron junctions 
respectively. However, more extensive consensus sequences spanning the 5’ and 
3’ splice junctions are evident (Mount, 1982; Padgett et al., 1986) and the coding 
sequence flanking intron junctions exhibits some degree of conservation (Long et 
al., 1998). More recently, Zhang (1998) has proposed AG | GTRAGT as a consen- 
sus sequence for donor splice sites and (Y), NCAG | G (where n has a mean of 
nine) as a consensus for acceptor splice sites. Stephens and Schneider (1992) 
noted the similarity between the donor and acceptor sites and suggested that these 
junctions may have been derived from a common ‘proto-splice site’ ancestor 
(Figure 1.5). During evolution, the emphasis of the sequence information at each 
site has shifted to the intronic side of the junction (Figure 1.5). 

A few human nuclear genes possess introns with noncanonical terminal dinu- 
cleotide sequences (e.g. ‘AT-AC introns’; Tarn and Steitz, 1997); these include the 
transcription factor E2F1 (20q11.2) gene, the cartilage matrix protein (MATNI; 
1p25) gene, the Hermansky—Pudlak syndrome (HPS; 10q23) gene and the paralo- 
gous sodium channel o-subunit genes (SCN4A, 17q23.1-q25.3; SCNSA, 3p24-p21; 
SCN8A, 12q13). The removal of AT-AC introns requires two low-abundance 
snRNAs, U4atac and U6atac, not found in the major spliceosome (Tarn and Steitz, 
1996). There is now good evidence for the existence of two distinct splicing systems 
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Figure 1.5. Putative schema for 
the evolution of donor and 
acceptor splice sites from a 
single ‘proto-splice site’ 
ancestor (from Stephens and 
Schneider, 1992). Sequence 
logos for proto-, donor and 
Proto acceptor splice sites are shown 
va ba with the heights of the 
q ot gained emphasis to become a 
so 5 . 
ES z -ep e ‘CAG’ whilst the ‘gt’ has 


individual letters being 
Donor Acceptor become even smaller. 


proportional to their 
frequencies at each position. 
On the donor side, the proto- 
splice site ‘cag’ has lost 
emphasis to become a smaller 
‘cag’ whilst the ‘gt’ has become 
a more strongly conserved 
‘GT’. On the acceptor side, the 
proto-splice site ‘cag’ has 


in eukaryotes which, having evolved in separate lineages, came together in a eukary- 
otic progenitor (Dietrich et al., 1997; Wu and Krainer, 1997; Burge et al., 1998). 
Some variant splice sites are evolutionarily conserved, for example the highly 
unusual GA donor splice site present in exon 10 of the Xenopus and human fibrob- 
last growth factor receptor 1 (FGFRI; 8p11) genes as well as in the paralogous 
FGFR2 (10q26) and FGFR3 (4p16) genes of mouse and human (Twigg et al., 1998). 

A further conserved sequence element, the ‘branch-point’, has been identified 
in the introns of eukaryotic genes and, when it occurs, is usually located some 
18-40 bp upstream of the 3’ splice site (Green, 1986). Although this sequence 
appears to play a role in forming a branch with the 5’ terminus of the intron, it 
exhibits a rather weak consensus sequence (Y,, N Yio Ts, Rg, Aigo Yous Krainer 
and Maniatis, 1988). The sequence UACUAAC appears to be the most efficient 
branch site for mammalian mRNA splicing both in vitro and in vivo (Zhuang et al., 
1989). The branch-point sequences in human can best be described by the con- 
sensus Y U V AY for loci with low G+C content and C U G/C A Y for loci with 
high G+C content (Zhang, 1998). Whereas both the length and location of the 
pyrimidine tract may be important determinants of branch-point and acceptor 
splice site utilization, the 3’ acceptor splice site itself appears to possess little 
specificity and may serve merely as the first AG dinucleotide downstream of the 
branch-point/pyrimidine tract. 

A further category of sequences required for alternative splicing are the splicing 
enhancers (Lopez, 1998). Several types of splicing enhancer have been recognized. 
These sequences occur in either exons or introns and serve to promote the use of 
neighboring weak 5’ or 3’ splice sites. They do this through the binding of serine- 
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arginine (SR) proteins, a family of modular splicing factors which contain one or 
more RNA-binding domains. 

Splicing occurs within the spliceosome, a complex assembly of small ribonucleo- 
protein particles (s%nRNPs) composed of a variety of snRNAs and associated pro- 
teins (reviewed by Lihrmann et al., 1990). The pre-mRNA is folded in such a way 
that splice sites are optimally aligned for cleavage and ligation. In this process, the 
snRNAs play a vital role. Our current understanding of mRNA splicing suggests 
that the formation of the 5’ splice site complex is contingent upon the prior for- 
mation of the 3’ splice site complex (Smith et al., 1993). Spliceosomal structure 
and function has been conserved in eukaryotes from yeast to humans (Wentz- 
Hunter and Potashkin, 1995). 


Polycistronic and polyprotein genes. In most cases, human genes adhere to the 
principle of ‘one gene, one polypeptide.’ However, some multi-chain protein 
genes encode more than one polypeptide. In such cases, synthesis of the mature 
protein involves the post-translational cleavage of a polypeptide precursor fol- 
lowed by association of the constituent chains. Thus, the human insulin UNS; 
11p15) gene encodes an A chain, a B chain, and a connecting peptide which is 
thought to be important in maintaining the conformation of the two chains. More 
often than not, however, multichain proteins are encoded by different genes, for 
example the a- and B-globins. Nevertheless, even although the microtubule-asso- 
ciated proteins 1A and 1B (MAPIA, 15q13-qter; MAP1B, 5q13) are encoded by 
different genes, both genes encode heavy and light chains that must be proteolyt- 
ically processed from a precursor polypeptide (Hammarback et al., 1991; Fink et 
al., 1996). 

Several human genes are polycistronic in that they encode different peptides 
with different functions, for example calcitonin/calcitonin gene-related peptide o 
(CALCA, 11p15; Allison et al., 1981), pancreatic polypeptide/pancreatic icosapep- 
tide (PPY; 17pll-qter; Boel et al., 1984) and vasoactive intestinal 
polypeptide/PHM27 (VIP; 6q26-q27; Bodner et al., 1985; Tsukada et al., 1985). A 
further putative polycistronic gene has been characterized in both mouse and 
human: the growth/differentiation factor-1 (GDF/J) gene is cotranscribed with a 
gene of unknown function (Uog1) separated from each other by a 269 bp inter- 
cistronic region (Lee, 1991). Finally, Reiss et al. (1998) described the polycistronic 
human molybdopterin synthase (MOCOD; 6p21.3) gene which contains two 
overlapping open reading frames (ORF) encoding distinct polypeptides required 
for the synthesis of molybdenum cofactor. The two polypeptides are encoded by 
the same 3.2 kb mRNA, the first by exons 1-9, the second by exon 10. Stallmeyer 
et al. (1999) have demonstrated that each ORF is translated independently. 

More dramatic is the case of the human glycinamide ribonucleotide synthetase 
(GART; 21q22.1) gene which generates an mRNA encoding a trifunctional pro- 
tein with three distinct enzymatic activities (glycinamide ribonucleotide syn- 
thetase, glycinamide ribonucleotide formyltransferase and aminoimidazole 
ribonucleotide synthetase which encode respectively the second, third and fifth 
enzymatic steps in the de novo synthesis of purines) (Brodsky et al., 1997). These 
activities are separate in bacteria and partly separate in yeast implying that 
ancient recombination events may have fused the coding regions together 
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(Davidson and Peterson, 1997). Fusion could have occurred as a means to allow 
coordinate regulation of the three enzymatic activities. 

One way in which polycistronic mRNAs might have arisen during evolution is 
by inefficient RNA polymerase termination of transcription of one mRNA species 
allowing the cotranscription of a second mRNA species from a downstream gene. 
Such a phenomenon has been shown to occur in normal human cells and involves 
the fusion splicing of two adjacent (9p13) genes, those encoding galactose-1-phos- 
phate uridylyl-transferase (GALT) and interleukin-11 receptor o-chain (ILIRA) 
(Magrangeas et al., 1998). 

Some human genes encode polyproteins and comprise multiple tandem repeats 
of the same coding region on the same transcript, for example those encoding the 
ubiquitins (UBA52, 19p13; UBB, 17p11-p12; UBC, 12q24; Wiborg et al., 1985) 
and filaggrin (FLG; 1q21; Gan et al., 1990; McKinley-Grant et al., 1989). 


7.1.3 DNA methylation 


5-methylcytosine (5mC) is the most common form of DNA modification in 
eukaryotic genomes. Soon after DNA synthesis is complete, target cytosines are 
modified by a DNA methyltransferase using S-adenosylmethionine as methyl 
donor. In humans, between 70% and 90% of 5mC occurs in CpG dinucleotides, the 
majority of which appear to be methylated (Cooper, 1983). 

Whereas the vertebrate genome is heavily methylated, methylation is virtually 
undetectable in insects and other arthropods (Cooper, 1983). An intermediate 
level of methylation is exhibited by the echinoderms, coelenterates and molluscs 
whose genomes are characterized by the presence of long methylated and 
unmethylated tracts (Cooper, 1983). The transition from a fractional to a global 
methylation pattern appears to have occurred close to the origin of the verte- 
brates since the cephalochordate Amphioxus exhibits a typically invertebrate pat- 
tern of genome methylation whereas the jawless vertebrates hagfish and lamprey 
possess a vertebrate pattern (Tweedie et al., 1997). The increase in size of the 
methylated compartment characteristic of vertebrate genomes correlates with 
the sharp increase in gene number during the invertebrate-vertebrate transition 
(Bird, 1995). DNA methylation may thus have been recruited as a transcriptional 
regulatory mechanism (Colot and Rossignol, 1999). 

DNA methylation is essential for normal mammalian development (Razin and 
Shemer, 1995). It is thought to play a role in both gene regulation (Kass et al., 
1997) and imprinting (Jaenisch, 1997; see section 1.1.3, Imprinting and imprinted 
genes), may serve as a cue for strand specificity in DNA replication and repair 
(Hare and Taylor 1988) and could conceivably serve as a self-defence mechanism 
to silence transposable elements and proviral DNAs integrated into the genome 
during evolution (Yoder et al., 1997; Simmen et al., 1999). This post-synthetic 
modification occurs almost exclusively in CpG dinucleotides of which between 
60% and 90% are methylated in mammalian tissues (Jost and Bruhat, 1997). 

Although it is as yet unclear how tissue-specific methylation patterns are estab- 
lished (Bestor and Tycko 1996; Turker and Bestor, 1997), they are nevertheless 
heritable and reproducible after transmission through the germline (Pfeifer et al., 
1990; Silva and White, 1988). The establishment of cell type-specific methylation 
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patterns in both the soma and the germline begins with global methylation of non 
CpG island sequences in the embryo whilst the final methylation patterns are 
determined by a specific and highly regulated process of demethylation (Razin 
and Shemer, 1995). 

In humans, differences in site-specific methylation patterns between different 
ethnic groups have not so far been apparent in studies of a variety of different 
types of DNA sequence (Behn-Krappa et al., 1991; Bottema et al., 1990; Kochanek 
et al., 1990). In keeping with these reports, Millar et al. (1998) found no evidence 
in sperm DNA for variable methylation status in the factor VIII (F8C; Xq28) 
gene between European Caucasians and Asians. However, the methylation status 
of specific CpG sites in the F8C gene did exhibit significant inter-individual vari- 
ation (Millar et al., 1998). 


CpG islands. Spatially, the distribution of CpG appears to be nonrandom in 
the human genome; about 1% of the genome consists of stretches very rich in 
CpG which together account for roughly 15% of all CpG dinucleotides (Bird, 
1986). In contrast to most of the scattered CpG dinucleotides, these CpG islands 
represent unmethylated domains and comprise ~50% of all unmethylated 
CpGs in the genome (Bird et al., 1985). CpG islands occur, on average, every 
100 kilobases in the murine genome (Brown and Bird, 1986) and, in human, are 
often located immediately 5’ to gene coding regions (Gardiner et al., 1990; 
Larsen et al., 1992; Figure 1.6). The methylation of CpG islands may modulate 
gene expression through its influence on chromatin organization (Kundu and 
Rao, 1999). 

Not all vertebrate genes, however, possess CpG islands (Bird, 1986; 
Gardiner-Garden and Frommer, 1987; Figure 1.6) and many are partially or 
even heavily methylated. In general, gene promoters containing CpG islands 
are unmethylated regardless of expression whereas promoters lacking CpG 
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Figure 1.6. The distribution of CpG within genes. (a) A typical CpG island containing 
gene as found in many ‘housekeeping’ genes. (b) A typical tissue-specific gene lacking a 
CpG island. 
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islands tend to lose their methylation upon transcription. CpG islands some- 
times also occur within the coding regions of genes as in the case of the human 
apolipoprotein E (APOE; 19q13) gene in which exon 4 constitutes a 940 bp 
CpG island. Other examples of human genes with CpG islands within their 
coding regions are the CDKN2A (9p21), MYODI (11p15.4), and PAX6 (11p13) 
genes. 

CpG islands are predominantly found in early replicating (R band) as opposed 
to late replicating (G band) regions of the genome (Craig and Bickmore, 1994). 
They may be considered to be the last remnants of the unmethylated domains 
that once dominated the vertebrate genome. The evolution of the heavily methy- 
lated vertebrate genome has been accompanied by a progressive loss of CpG din- 
ucleotides as a direct consequence of their methylation in the germline. Although 
CpG islands are usually unmethylated and therefore relatively immune to muta- 
tional decay (Luoh et al., 1995; see section 1.1.3, CpG suppression in the vertebrate 
genome and its origin), there is nevertheless some evidence for their gradual erosion 
over evolutionary time (Matsuo et al., 1993). 


CpG suppression in the vertebrate genome and its origin. Methylation of 
cytosine results in a high level of mutation due to the propensity of 5mC to 
undergo deamination to form thymidine. Deamination of 5mC probably does not 
occur during the enzymatic replication of the methylation pattern which appears 
to be a high fidelity process. Indeed, 5mC deamination probably occurs with the 
same frequency as the deamination of cytosine to uracil. However, whereas uracil 
DNA glycosylase activity in eukaryotic cells is able to recognize and excise uracil, 
thymine being a ‘normal’ DNA base is thought to be both less readily detectable 
and removable by cellular DNA repair mechanisms. 

One consequence of the hypermutability of 5mC is the paucity of CpG in the 
genomes of many eukaryotes (Setlow, 1976), the heavily methylated vertebrate 
genomes exhibiting the most extreme ‘CpG suppression’ (Bird, 1980; Jabbari et 
al., 1997; Schorderet and Gartler, 1992). A first estimate of the in vivo rate at 
which 5mC is deaminated and fixed as thymidine was arrived at by extrapolation 
from in vitro data (Cooper and Krawczak, 1989). To this end, the deamination rate 
of 5mC as measured under laboratory conditions in single stranded DNA, was 
modified so as to be consistent with the observed spectrum of point mutations 
found to have caused human genetic disease. The rate estimate of 1.66 x 
10-*sec"! was consistent with the rate calculated from the evolutionary pattern of 
CpG substitution exhibited by B-globin gene and pseudogene sequences in 
human, chimpanzee and macaque (Cooper and Krawczak, 1989). Mathematical 
modelling allowed a detailed study of the dynamics underlying the CpG sup- 
pression currently found in the bulk DNA of vertebrate genomes (Cooper and 
Krawczak, 1989). It was inferred that the time span required in order to create 
the currently observed CpG frequency (0.01) was 50 to 100 Myrs. On the other 
hand, the process of CpG loss must have lasted for approximately 450 Myrs in 
order for the mononucleotide frequencies to have attained their present levels. 
This time span corresponds closely to the estimated time since the emergence 
and adaptive radiation of the vertebrates and thus coincides with the probable 
advent of heavily methylated genomes. These data are therefore consistent with 
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patterns of vertebrate gene methylation having been comparatively stable over 
relatively long periods of evolutionary time. 


The CpG dinucleotide as a mutation hotspot. CpG has been found to be a 
hotspot for mutation in a wide range of different human genes (reviewed by 
Cooper and Krawczak, 1993). Of the single base-pair substitutions so far 
reported to cause human genetic disease (Krawczak et al., 1998), ~30% involve a 
CpG dinucleotide, and ~79% (~23% of the total) of these are either C->T or 
G—dA transitions. These data imply that the rate at which CpG mutates to either 
TpG or CpA is some five-fold higher than the basal mutation rate per nucleotide. 
Methylation-mediated deamination of 5mC therefore represents a major cause of 
gene mutability leading to human genetic disease. 

The proportion of human gene mutations compatible with a model of methyla- 
tion-mediated deamination (CG->TG, CA) is however very much an average fig- 
ure and provides no information on individual genes. When these values are 
recalculated on a gene-specific basis, considerable variation becomes apparent. 
The frequency of CG-TG and CG->CA mutations may be smaller than 10% 
(HBB (4%), HPRTI (5%), TTR (9%)) or larger than 50% (MYH7 (67%), ADA 
(63%), RBI (57%)). In the assumed absence of a detection bias (Cooper and 
Krawczak, 1993), we may surmise that this variation is due either to differences in 
(a) germline DNA methylation and/or (b) relative intragenic CpG frequency 
(itself dependent upon (a)). Imbalances in the propensity to transition at CpG 
dinucleotides exist even between the two DNA strands. For the human hypoxan- 
thine phosphoribosyltransferase (HPRT1) gene, Skandalis et al. (1994) have 
noted a significant strand bias in mutations recovered at CpG sites, thereby con- 
firming our own findings (Cooper and Krawczak, 1993; Krawczak et al., 1998). 

CpG hypermutability in inherited disease implies that the CpG sites in ques- 
tion are methylated in the germline thereby rendering them prone to 5mC deam- 
ination. However, on its own, CpG hypermutability still represents very indirect 
evidence for CpG methylation. That 5mC deamination is itself directly responsi- 
ble for these mutational events is evidenced by the fact that several cytosine 
residues known to have undergone a germline mutation in the low density 
lipoprotein receptor (LDLR; hypercholesterolemia), p53 (7P53; various types of 
tumor), factor VIII (F&C; hemophilia A), and neurofibromatosis type 1 (NFI) 
genes are indeed methylated in sperm (Rideout et al., 1990; Andrews et al., 1996; 
Millar et al., 1998; El-Maarri et al., 1998). As yet, however, these are the only stud- 
ies to have attempted to correlate CpG hypermutability with DNA methylation 
directly for specific CpG dinucleotides. 


Imprinting and imprinted genes. DNA methylation is thought to be involved 
in imprinting which was defined by Monk (1988) as the ‘differential modification 
of the maternal and paternal contributions to the zygote, resulting in the differ- 
ential expression of parental alleles during development and in the adult.’ This 
differential modification appears to be essential for normal mammalian develop- 
ment since parthenogenetic embryos (whether diploid paternal or diploid mater- 
nal) do not survive to term: in diploid maternal embryos, fetal development is 
normal but development of the extraembryonic membranes is abnormal. In 
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diploid paternal embryos, it is the other way around (reviewed by Monk, 1988). 
Clearly, maternal and paternal chromosomes must differ epigenetically and in 
such a way that different developmental programmes are followed. 

Only some 100-200 genes in mammalian genomes are imprinted and these are 
often clustered (Barlow, 1995). Human examples include the INS, H19 
(D11S878E) and JGF2 (11p15.5), SNRPN (15ql1-q12), WTI (11p13), IGF2R 
(6q25-q27), and XIST (Xq13.2) genes. The XIST gene is essential for X-inactiva- 
tion (a special case of imprinting) and is expressed exclusively from the inactive X 
chromosome (Jamieson et al., 1996; Kay 1998). It gives rise to a non-coding mRNA 
product that controls the production of an inactivation signal which spreads along 
the chromosome, silencing all but a handful of genes (Heard et al., 1997). 

A common function of imprinted genes is in the control of embryonic growth 
with paternally expressed genes (e.g. JGF2) tending to enhance growth rates and 
maternally expressed genes (e.g. H19, IGF2R) reducing them. This dichotomy 
has led to the proposal of the genetic conflict hypothesis (Haig and Trivers 1995) 
which attempts to explain the evolution of imprinting in terms of a conflict of 
interest between the maternal and paternal genes of an individual. This theory 
might predict antagonistic coevolution between maternally and paternally 
derived genes for the control of fetal growth. However, contrary to the expecta- 
tions of the conflict hypothesis, the rate of evolution of imprinted genes is not sig- 
nificantly different from that exhibited by non-imprinted genes encoding 
receptors (McVean and Hurst, 1997). 

It has been suggested that the imprinted expression of genes is usually con- 
served between human and mouse (Barlow, 1995) but whilst the maternal-specific 
expression of the [gf2r gene is seen in mouse (Barlow et al., 1991), imprinting of 
the IGF2R gene is only apparent in a minority of the human population (Xu et al., 
1993; Ogawa et al., 1993). Such an ‘imprinting polymorphism’ has also been noted 
for the WTI gene in human populations (Nishiwaki et al., 1997). 

Imprinted genes may have fewer and smaller introns than nonimprinted genes 
(Hurst et al., 1996; Haig, 1996). Whether this is an adaptation to allow these genes 
to be transcribed rapidly or whether it is merely a property of the chromosomal 
region in question is however unknown. 

It is at present unclear if imprinting is confined to mammals. Parent-of-origin- 
specific effects have been claimed in the zebrafish, Danio rerio (McGowan and 
Martin, 1997). If confirmed, this would be reconcilable with the observed survival 
of androgenetic zebrafish only if the expression of the imprinted genes were not 
required for development. 


1.1.4 Repetitive sequence elements 


There is repetition everywhere, and nothing is found only once in the world. 
J. G. v. Goethe 


Repetitive DNA comprises the bulk (>90%) of the human genome. A large num- 
ber of different types of repetitive sequence element are found in the human 
genome (reviewed by Jelinek and Schmid, 1982; Vogt, 1990) and their analysis 
may go some way toward explaining the patterns of chromosome bands noted 
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after the staining of metaphase chromosomes. Three main categories are now rec- 
ognized: (i) a highly repetitive class including sequence families with >10° copies 
per haploid genome, (ii) a middle repetitive sequence class (107-10° copies), and 
(iii) a low repetitive class whose members possess between two and 100 copies per 
haploid genome. The different classes of highly repetitive satellite DNA (tandem 
repeats) are briefly reviewed here together with Alu repeats and LINE (L1) ele- 
ments, the two most abundant and best characterized of the interspersed middle 
repetitive sequence families in the human genome. 


Tandem repeats. Tandemly repetitive DNA comprises satellite DNA, minisatel- 
lite DNA and microsatellite DNA. Satellite DNA comprises the majority of hete- 
rochromatin and is clustered in tandem arrays of up to several megabases (Mb) in 
length. A number of different families [e.g. simple sequence (5-25 bp repeats), 
alphoid (169 bp and 172 bp repeat), Sau3A (~68 bp repeat)] have been identified 
(reviewed by Vogt, 1990) which are largely confined to the centromeres. One func- 
tion of satellite DNA could be to maintain regions of late replication thereby 
ensuring that the centromere is the last region to replicate on a chromosome 
(Csink and Henikoff, 1998). 

The hypervariable minisatellite sequences (about 10* copies/genome) share a 
core consensus sequence [GGTGGGCAGARG] which is reminiscent of the 
Escherichia coli Chi element known to be a signal for generalized recombination 
(Jeffreys, 1987). These minisatellites exhibit substantial copy number variability 
in terms of the number of constituent repeat units and are often telomeric in 
location (see section 1.1.1, Télomeres). 

Microsatellite DNA families are simple sequence repeats, the most common 
being (A),/(T), (CA),/(TG), and (CT), /(AG), types (Beckmann and Weber, 1992; 
Vogt et al., 1990). Minisatellites and microsatellites account for between 0.2% and 
0.5% of the human genome, respectively and are widely scattered on many chro- 
mosomes. Their high copy number variability and association with a considerable 
number of different genes has meant that they provide a very valuable source of 
highly informative markers for disease analysis and diagnosis (Bruford and 
Wayne, 1993). Since their presence at specific homologous chromosomal locations 
is often evolutionarily conserved in primates (Coote and Bruford, 1996; Crouau- 
Roy et al., 1996; Morin et al., 1998), they may also be useful in both population 
genetic and evolutionary studies. 

More recently, a further type of tandemly reiterated DNA sequence has been 
described (Gondo et al., 1998). The 4746 bp RS447 repeat sequence, quaintly 
named ‘megasatellite’ DNA, is repeated between 50 and 70 times on chromosome 
4p15. It is evolutionarily conserved among mammals, polymorphic in humans 
and since it contains an open reading frame, it may encode a protein product. 


Alu sequences and other SINEs. The Alu family of short interspersed repeated 
elements (SINEs) is present in all primates. Up to 900 000 copies are thought to 
exist in the human genome (some 5% of the total DNA complement) with an aver- 
age spacing of 4 kb (Hwu et al., 1986). Most occur in noncoding DNA but some 
are known to be located in untranslated regions (Makalowski et al., 1994) or even 
coding regions (Margalit et al., 1994). 
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Alu repeats share a recognizable consensus sequence but the extent of homol- 
ogy to this consensus varies from 72% to 99% (Kariya et al., 1987; Batzer et al., 
1990). Human Alu sequences are ~300 bp in length, are polyadenylated and con- 
sist of two related sequences each between 120 and 150 bp long separated by an A- 
rich region (Figure 1.7). Alu sequences appear to be degenerate forms of 7SL RNA 
(RN7SL) that have been reverse transcribed and integrated into the genome (Ullu 
and Tschudi, 1984). As to a possible function for Alu sequences, the question still 
remains open (Schmid, 1998). 

Several reports of transcription of Alu sequences either by RNA polymerase II 
or III have appeared (Maraia et al., 1993) but their transcription is often silenced 
by DNA methylation (Liu et al., 1994) and/or nucleosome positioning (Englander 
et al., 1993). Alu sequences contain an internal RNA polymerase III promoter 
(Jelinek and Schmid 1982), a functional retinoic acid response element (Vansant 
and Reynolds, 1995) and a regulatory element that can confer positive or negative 
regulation of transcription upon a variety of promoters in vitro (Brini et al., 1993). 
The presence of the polIII promoter is important since it ensures high expression 
in the germline, a prerequisite for efficient transposition. 

There are at least four different types of Alu sequence which belong to two dis- 
tinct subfamilies (Jurka and Smith, 1988; Britten et al., 1988; Deininger and 
Batzer, 1993). Some types of Alu sequence are human-specific (Batzer et al., 1990; 
Batzer and Deininger, 1991) and these appear to be derived from a number of 
different but closely related master copies or ‘source genes’ (Matera et al., 1990a). 
The vast majority of human Alu sequences appear to be transcriptionally inert. 
Some of the human-specific subfamilies are transpositionally competent and these 
may also be transcriptionally active (Matera et al., 1990b; Sinnett et al., 1992). 

Alu repeats are concentrated in R bands in metaphase chromosomes (Korenberg 
and Rykowski, 1988) whereas these repeats are under-represented in other regions 
(e.g. centric heterochromatin) (Moyzis et al., 1989). R bands are G+C rich, replicate 
their DNA early in S phase and condense late in mitotic prophase. R bands also 
contain the bulk of active gene sequences. One consequence of the nonrandom 
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Figure 1.7. Structure of the human Alu repeat element compared with the related 7SL 
RNA. A 155 bp portion of the 7SL RNA is absent from the Alu sequence as indicated. 
Poly(A) stretches are denoted by (A),. 
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distribution pattern of Alu sequences in the genome is that procedures which screen 
for these repeats in genomic DNA clones will preferentially locate gene sequences. 

SINEs have a long evolutionary history, being present in all vertebrate classes 
as well as being found in molluscs (Gilbert and Labuda, 1999). SINEs in verte- 
brate genomes include various types of DNA transposable element such as Tiggers 
and mariners which are characterized by the possession of terminal inverted 
repeats and target site duplications evident as flanking direct repeats (Smit and 
Riggs, 1996; Robertson and Martos, 1997). The Tigger element closely resembles 
pogo, a DNA transposon in Drosophila. Most copies of the mariner family of trans- 
posable elements, of which there are probably more than 1000 in the human 
genome, probably originated between 80 and 50 Myrs ago (Robertson and 
Zumpano, 1997; Robertson and Martos, 1997). In addition, the MIR transposable 
element which has a simple tRNA-like internal polymerase III promoter, was 
amplified to several hundred thousand copies before the adaptive radiation of the 
mammals (Murnane and Morales, 1995; Smit and Riggs, 1995). 


LINE elements. Some 10° copies of long interspersed repeat elements (LINES) 
are present in the human genome (Hwu et al., 1986). LINES are derived from 
polll transcripts and account for perhaps 2-3% of the total DNA complement 
(reviewed by Skowronski and Singer, 1986; Singer et al., 1993). LINE elements 
have been found in all mammalian species so far examined and may be traced 
back to an ancestral LINE element that originated before the adaptive radiation 
of mammals (Furano and Usdin, 1995). Human LINE elements vary in size from 
as little as 60 bp up to 6-7 kb; about 95% are truncated at their 5’ ends but they 
mostly appear to contain the same 3’ sequences as well as a poly(A) tail of variable 
length (Figure 1.8). Each individual LINE element differs from the consensus 
sequence (Scott et al., 1987) by ~13% although many exhibit internal deletions 
and rearrangements. The majority of human LINE elements appear to have been 
generated within the last 30 Myrs (Scott et al., 1987). 
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Figure 1.8 Structure of a full-length human LINE (L1) element. UTR: Untranslated 
region. ORF: Open reading frame. RT: Reverse transcriptase. 
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LINE elements are confined to the G/Q (Giemsa/Quinacrine) bands of the 
euchromatin (Korenberg and Rykowski, 1988). G/Q bands are A+T rich, replicate 
their DNA late during the DNA synthetic period, condense early during mitosis 
and are relatively poor in expressed genes. 

LINES probably represent processed pseudogene-like copies of reverse tran- 
scripts which have been re-integrated into the genome (Hattori et al., 1986). A 
full-length LINE element possesses two open-reading frames, ORF1 (1 kb) and 
ORF2 (4 kb); the latter possesses reverse transcriptase activity (Mathias et al., 
1991) which may serve to mediate the retrotransposition not only of the LINE 
elements themselves (Sassaman et al., 1997; DeBerardinis et al., 1998) but also of 
other retroposons such as Alu sequences (Jurka 1997; Gilbert and Labuda, 1999). 

LINE element transcripts have been found in undifferentiated teratocarcinoma 
cells (Skowronski and Singer, 1985; Skowronski et al., 1988) suggesting that they 
may be expressed early on in mammalian development. A promoter at the 5’ end 
appears to be responsible for the specific expression of LINE elements in terato- 
carcinoma cells (Swergold, 1990). However, only a small subset of all LINE ele- 
ments is capable of being transcribed. 


Endogenous retroviral sequences and transposons. Human transposable ele- 
ments are essentially of two kinds, those that undergo transposition through a DNA 
intermediate (transposons) and those that undergo transposition through an RNA 
intermediate (retroelements). In excess of 10% of the human genome comprises inte- 
grated copies of RNA molecules including retroviruses, retroviral-like DNAs, retro- 
posons and retrotranscripts (reviewed by Cohen and Larsson, 1988; Amariglio and 
Rechavi, 1993; McDonald, 1993; Leib-Mosch and Seifarth, 1996). This represents 
more than 500,000 separate integration events. Nonviral retroposons include the 
Alu sequences and LINE elements discussed above. A number of endogenous retro- 
viral or retroviral-like sequence families have been identified and characterized. 
These include HERV-K (Ono et al., 1987; Goodchild et al., 1995; Mayer et al., 1997a; 
1997b, 1999), RTVL-1 (Maeda and Kim, 1990), RTVL-H (Wilkinson et al., 1993; 
Goodchild et al., 1993), MaLR (Smit, 1993), LTR13 (Liao et al., 1998), and the 
immunoglobulin gene-related human transposon (THE]) (Deka et al., 1988; Fields 
et al., 1992; Hakim et al., 1994). The long terminal repeats of the HERV-K family are 
capable of binding human host cell nuclear proteins (Akopov et al., 1998). 

Retroviral sequences have sometimes become integrated into human genes (e.g. 
the endogenous retrovirus HRES-1 lies within the coding sequence of the transal- 
dolase gene (TALDO; 11p15; Banki et al., 1994; see Chapter 9, section 9.4) and 
once transposed, they may even be recruited to play a role in the transcriptional 
regulation of a gene [e.g. as postulated for the human salivary amylase (AMYIC) 
gene (Ting et al., 1992); see Chapter 5, section 5.1.13]. 


7.1.5 Genes, mutations and disease 


A central message of this volume is that the same mutational mechanisms that are 
responsible for disrupting the structure and function of human genes causing 
inherited disease are also responsible for having both created and fashioned these 
same genes over millions of years of evolutionary time. Since Chapters 7-9 are 
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devoted to discussing mutational mechanisms in evolution, it is pertinent to 
devote at least some space here to summarizing briefly what is known of muta- 
tional mechanisms underlying human genetic disease. 

In general, there are three types of mutation that give rise to an inherited dis- 
ease: (i) mutations which lead to a loss of function, (ii) mutations which lead to a 
gain of function that is deleterious, and (iii) dominant negative mutations that 
adversely affect protein subunit activity or assembly. Characterized gene muta- 
tions causing genetic disease have been found to occur within coding sequences, 
untranslated sequences, promoter and locus control regions, in splice junctions, 
within introns and in polyadenylation sites [reviewed in detail by Cooper and 
Krawczak, 1993; see also the Human Gene Mutation Database at 
http://www.uwcm.ac.uk/uwcm/mg/hgmd0.html, an information resource which 
currently contains details of >18 500 different mutations in >900 different genes; 
Cooper et al., 1998]. Indeed, they may interfere with any stage in the pathway of 
expression from gene to protein product. 

Table 1.2 presents a basic classificatory system of mutation types by reference to 
the nature and position of the gene lesion and the stage in the expression pathway 


Table 1.2. A classification of types of mutation found to cause human single gene defects 
through either reduced synthesis of a normal protein or normal synthesis of an abnormal 
protein 


(a) Reduced synthesis of a normal gene product 


Defect in 
Promoter function Binding of positive regulatory protein reduced or abolished 
Binding of negative regulatory protein increased 
Gene structure Deletions (frameshift) 
Insertions, duplications, inversions (frameshift). 
RNA processing Mutations in transcriptional initiation site causing failure to 
stability initiate transcription 


Splice junction mutations resulting in exon skipping and/or 
cryptic splice site utilization 
Activation of cryptic splice sites 
Polyadenylation/cleavage signal mutations 
Mutations in 3’ untranslated region 
Translation Initiation and termination codon mutations 
Mutations in 5’ untranslated region 
Nonsense mutations 


(b) Synthesis of structurally/functionally abnormal gene product 
Gene structure defect resulting in 


Shortened gene product Deletions (in-frame), nonsense mutations 

Fusion genes Deletions involving two linked genes 

Elongated gene product Insertions, duplications (in-frame) 
Termination codon mutations 

Defective post-translational Missense mutations 


modification or processing, 
instability of protein product, 
impaired assembly/secretion, 
altered substrate/cofactor/ 
receptor affinity. 
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(Figure 1.2) with which it interferes. Mutations are firstly divided on the basis of 
whether they result in the reduced synthesis of a gene product (A) or the synthe- 
sis of a structurally/functionally abnormal gene product (B). Mutations are then 
secondarily divided into four categories; promoter function, gene structure, RNA 
processing and translation. Some gene lesions may be placed into more than one 
category. For example, missense mutations with drastic effects on protein struc- 
ture and stability or which serve to activate an exonic cryptic splice site can fall 
into both categories. Similarly, a missense mutation close to an intron/exon splice 
junction could affect mRNA splicing efficiency as well as protein structure. The 
effects of specific amino acid substitutions on protein structure are the subject of 
several reviews (Pakula and Sauer, 1989; Alber, 1989; Wacey et al., 1994). 

In the context of human pathology, by far the most frequent genetic lesions in 
the genome are point mutations and deletions. The remainder comprise a mixed 
assortment of insertions, duplications, inversions, sequence amplifications and 
complex rearrangements. A brief review of the different types of known patholog- 
ical lesion will be given. 


Single base-pair substitutions within the coding region. Some 23% of point 
mutations are CG->TG or CG->CA transitions, representing a five-fold higher 
frequency for this dinucleotide than that predicted from random expectation 
(Krawczak et al., 1998). This is thought to be due to the hypermutability of the 
methylated dinucleotide CpG; spontaneous deamination of 5-methylcytosine 
(SmC) to thymidine in this doublet gives rise to C>T or G—A transitions 
depending upon the strand in which the 5mC is mutated. CpG hypermutability in 
inherited disease implies that the CpG sites in question are methylated in the 
germline and thereby rendered prone to 5mC deamination. 

The spectrum of point mutations occurring outwith CpG dinucleotides is also 
nonrandom (Cooper and Krawczak, 1993; Krawczak et al., 1998). In principle, the 
nonrandomness of the initial mutation event, the nonrandomness of the DNA 
sequences under study, differences in the relative efficiency with which certain 
mutations are repaired, differences in phenotypic effect (and hence selection), or 
a bias in the clinical detection of such variants, may all play a role in determining 
the observed mutational spectrum. 

The majority of single base-pair substitutions causing human genetic disease 
alter the amino acid encoded (missense mutations) but a sizeable proportion 
result in the introduction of a termination codon (nonsense mutations). The like- 
lihood of clinical detection is estimated to be about three times as high for non- 
sense mutations as for missense mutations (Krawczak et al., 1998). Using a 
multidomain molecular model of the human factor IX protein, Wacey et al. (1994) 
have shown that the likelihood that a factor IX gene (F9; Xq28) mutation (caus- 
ing hemophilia B) will come to clinical attention is a complex function of the 
sequence characteristics of the F9 gene, the nature of the amino acid substitution, 
its precise location and immediate environment within the protein molecule, and 
its resulting effects on the structure and function of the protein. 


Single base-pair substitutions within splice sites. Splicing defects have been 
estimated to make up between 8% and 15% of all single base-pair substitutions 
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causing human genetic disease (Krawczak et al., 1992). Mutations appear to occur 
disproportionately at the most evolutionarily conserved positions within the 
splice site (Krawczak et al., 1992). This is due to a detection bias resulting from 
the relative phenotypic severity of these lesions. Phenotypic consequences of 
splice site mutations include exon skipping and cryptic splice site utilization. 


Single base-pair substitutions within promoter regions. A wide variety of 
mutations are now known which occur within gene promoter regions (Cooper and 
Krawczak, 1993). These lesions disrupt the normal processes of gene activation and 
transcriptional initiation and serve to decrease (or rather less often, increase) the 
level of mRNA/protein product synthesized. What these lesions have in common is 
their ability to alter or abolish the binding capacity of cis-acting DNA sequence 
motifs for the trans-acting protein factors that normally interact with them. 


Gross gene deletions. Gross gene deletions may arise through a number of dif- 
ferent recombinational mechanisms including homologous unequal recombina- 
tion (occurring either between related gene sequences or repetitive elements). Alu 
sequences have been noted to flank deletion breakpoints in a considerable num- 
ber of human genetic conditions and may represent hotspots for gene deletions 
(Cooper and Krawczak, 1993). 


Microdeletions. Deletion breakpoint junctions flanking short (<20 bp) human 
gene deletions are non-random both at the nucleotide and dinucleotide levels, an 
observation consistent with a sequence-directed mechanism of mutagenesis 
(Cooper and Krawczak, 1993). Direct repeats flanking the deleted sequence are a 
common finding, consistent with a model of slipped mispairing at the replication 
fork. Two specific types of sequence have been found at high frequency in the vicin- 
ity of short gene deletions: polypyrimidine runs of at least 5 bp (YYYYY) and a 
‘deletion hotspot consensus sequence’ (TGRRKM) (Cooper and Krawczak, 1993). 


/nsertions. That insertional mutagenesis might be as intrinsically non-random as 
point mutations and gene deletions was strongly suggested by the findings of 
Fearon et al. (1990), who reported 10 independent examples of DNA insertion 
within the same 170 bp intronic region of the DCC (18q21) gene (a locus which 
has been proposed to play an important role in human colorectal neoplasia). 
Insertional mutations involving the introduction of <10 bp DNA sequence into a 
gene coding region are (i) nonrandom and appear to be highly dependent upon 
the local DNA sequence context and (ii) may be explained by those mechanisms 
held to be responsible for gene deletions (Cooper and Krawczak, 1993). 


Inversions. Inversions are a highly unusual mutational mechanism causing 
human genetic disease. The best known example is that found recently in the fac- 
tor VIII (F8C) gene causing hemophilia A: this rearrangement occurs in about 
40% of severely affected patients and recurs at high frequency (Lakich et al., 1993; 
Naylor et al., 1993). The mechanism responsible is thought to be homologous 
intrachromosomal recombination between a gene (F8A) located in intron 22 of 
the F&C gene and two additional homologues of the F8A gene situated 500 kb 
upstream of the F8C gene. 
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Expansion of unstable repeat sequences. A recently recognized mutational 
mechanism involves the instability of certain specific trinucleotide repeat 
sequences (reviewed by Timchenko and Caskey, 1996). This mechanism was first 
reported as a cause of the fragile X mental retardation syndrome. The brain- 
expressed FMRI gene responsible was found to contain an ususual (CGG)n repeat 
which exhibited copy number variation of between 6 and 54 in normal healthy 
controls, between 52 and >200 in phenotypically normal transmitting males (the 
‘premutation’) and between 300 and >1000 in affected males (the ‘full mutation’). 
Expansion of premutations to full mutations occurs only during female meiotic 
transmission whilst the probability of repeat expansion correlates with repeat 
copy number, consistent with a mechanism of slipped mispairing during replica- 
tion. Expansion of a sequence can thus itself lead to further expansion, a process 
termed ‘dynamic mutation’ by Richards et al. (1992). The discovery of this novel 
mutational mechanism soon led to the recognition that the expansion of unstable 
repeats is responsible for a number of other human inherited diseases, almost all 
neuromuscular (see Chapter 8, section 8.9.1). 

The triplet repeats (CGG)n and (CAG)n are very abundant in the human 
genome (Stallings, 1994; Han et al., 1994). A considerable number of human 
genes, both ubiquitously expressed and tissue-specific, have now been identified 
as containing such triplet repeats (Riggins et al., 1992; Karlin and Burge, 1996). 
Many of these repeats are highly polymorphic and may thus represent examples 
of more subtle triplet repeat expansions. Whether specific polymorphic alleles are 
associated with any particular phenotype, however, remains to be seen. 

This chapter has attempted to provide the reader with a brief introduction to what 
is known of structure—function relationships in the human genome. Specific DNA 
sequences have clearly evolved for different cellular functions. Some sequences rep- 
resent gene coding regions that contain the genetic information necessary to direct 
the synthesis of proteins. Other sequences in gene promoter or untranslated regions 
serve to direct appropriate gene expression or are involved in mRNA processing, sta- 
bility and nucleocytoplasmic transport. Some highly repetitive sequences may play 
an important role in chromosome architecture whereas others may almost be com- 
mensally parasitic in that they cohabit in the genome having increased in copy num- 
ber under their own mutational momentum. Some sequences are modified by DNA 
methylation which itself may influence function. Other sequences, by their very 
nature, are more mutable than others and may be capable of rapid change. Whatever 
their cellular role, many DNA sequences are able to interact with proteins or mRNA 
molecules in order to effect their function. It follows that the imminent analysis of 
the human genome sequence should reveal the existence of new sequence codes (and 
perhaps codes within codes) to add to that first elucidated in the early 1960s. 
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2 


Evolution of the human 
genome 


2.1 Ancient genome duplications at the dawn of vertebrate 
evolution 


2.1.1 Evidence for an ancient genome duplication 


In evolution, novel genes have arisen either by whole genome duplication or by 
regionally localized duplication events (see Chapter 9, section 9.5). Both mecha- 
nisms give rise to paralogous genes, genes that occur within the same species and 
which have a common ancestor. Thus, members of multigene families and super- 
families are paralogous and are distinct from orthologous genes which are found in 
different species and have diverged from their common ancestor over evolution- 
ary time. In practice, the consequences of whole genome duplication and region- 
ally localized gene duplication are often very hard to distinguish. Skrabanek and 
Wolfe (1998) elegantly explained the problem by analogy: 


‘Take four, or maybe eight, decks of 52 playing cards. Shuffle them all together 
and then throw some cards away. Pick 20 cards at random and drop the rest on 
the floor. Give the 20 cards to some evolutionary biologists and ask them to 
figure out what you’ve done. For encouragement, tell them they can have the 
cards on the floor in 2005 (the estimated date of completion of the human 
genome sequence)’. 


Ohno (1970) first proposed that the vertebrate genome had evolved to its present 
size through an ancient tetraploidization event. Evidence supporting this hypoth- 
esis now comes from a variety of different sources. Comparative nuclear DNA 
measurements in higher organisms are compatible with the idea of successive 
genome doublings (Ohno, 1970; Sparrow and Nauman, 1976). So are cytogenetic 
data, since the human karyotype can be divided into similar pairs of chromo- 
somes on the basis of structure and banding patterns viz. chromosomes | and 2, 4 
and 5, 7 and 8, 11 and 12, 14 and 15, 16 and 17, 19 and 20, 21 and 22 (Comings 
1972; McKusick, 1980). The postulated ancient tetraploidization event also 
appears to be reflected in the genetic information content of these chromosome 
pairs. This is exemplified by the distribution of the insulin and RAS oncogene 
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family members in the human genome: the insulin (INS), insulin-like growth 
factor 2 JGF2) and Harvey RAS (HRAS) genes are located on chromosome 11p 
whilst the insulin-like growth factor 1 ((GF1) and Kirsten RAS (KRAS2) genes 
are located on chromosome 12 (Hoppener et al., 1985). 

That the genome duplications occurred prior to the adaptive radiation of the 
vertebrates is evidenced by the similar gene number exhibited by both fish and 
mammals (Elgar et al., 1996), the survival of syntenic linkage groups over quite 
long periods of evolutionary time (Lundin 1993; Trower et al., 1996) and the 
tetralogy exhibited by many vertebrate loci. This tetralogy is evident from the 
observation that the basic set of ~15 000 genes found in all primitive metazoans 
varies only moderately from Caenorhabditis elegans (Ahringer, 1997) to Drosophila 
(Miklos and Rubin, 1996) to Ciona intestinalis, a tunicate (Simmen et al., 1998) but 
this gene complement is approximately four-fold smaller than the total number of 
genes in vertebrate genomes. At the level of the individual gene, there are numer- 
ous instances where an invertebrate (Drosophila) gene has up to four related genes 
(paralogues) in vertebrates, implying two rounds of genome duplication (Spring, 
1997). These paralogues represent quadruplicated loci on different chromosomes 
that are more similar to each other than they are to members of other tetralogous 
groups. In humans, these so-called ‘tetralogues’ include the HOX genes (HOXA, 
7p14-p15; HOXB, 17q21-q22; HOXC, 12q12-q13; HOXD, 2q31), the epidermal 
growth factor receptor genes (EGFR, 7p12; ERBB2, 17q11.2-q12; ERBB3, 
12q13; ERBB4, 2q34), the Jak family of tyrosine kinases (AKI, 1p31.3-32.3; 
FAR2, 9p24; FJAK3, 19p12-p13; TYK2, 19p13.2), the MADS box enhancing fac- 
tors (MEF72A, 15q26; MEF2B, 19p12; MEF2C, 5q14; MEF2D, 1q12-q23), the Sre 
family of nonreceptor tyrosine kinases (SRC, 20q11.2; YES1, 18p11.22-p11.31; 
FGR, 1p36.1-p36.2; FYN, 6q21) and the syndecans (SDC1, 2p; SDC2, 8q22-q23; 
SDC3, 1p32; SDC4, 20q12-q13) (Spring 1997; Pebusque et al., 1998). Many fur- 
ther examples of triplicated loci occur in the human genome (e.g. Katsanis et al., 
1996; Plummer and Meisler, 1999). In these cases, it may be that one paralogue 
has been lost or alternatively, still remains to be characterized (Spring, 1997). 

The results of studies of the genes encoding the homeobox proteins, insulin- 
like growth factor genes and high mobility group proteins (Chan et al., 1990; 
Garcia-Fernandez and Holland, 1994; Holland, 1991; Pendleton et al., 1993; 
Sharman et al., 1996; see Chapter 4, sections 4.2.1, Homeobox genes and 4.2.3, 
Insulin and insulin-like growth factor genes) in the primitive chordates Amphioxus 
and Ciona, and a jawless vertebrate Lampetra fluviatilis (lamprey), are consistent 
with the occurrence of one genome duplication in the common ancestor of all 
jawed and jawless vertebrates after the lineage leading to Amphioxus had diverged, 
and a second genome duplication occurring in the common ancestor of jawed ver- 
tebrates after their divergence from the jawless vertebrates (Sidow, 1996). Studies 
of HOX gene number suggest that the duplication events are likely to have 
occurred before the radiation of the teleosts (Amores et al., 1998). 

Gene number does not however automatically distinguish between tandem 
duplications and polyploidization events. Postlethwait et al. (1998) mapped 144 
zebrafish (Danio rerio) genes; comparison of the resulting map with their mam- 
malian counterparts led to the identification of orthologous chromosome segments 
for at least three chromosome paralogy groups in zebrafish and mammals. This 
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finding is consistent with the hypothesis that these segments were duplicated prior 
to the divergence of zebrafish and mammals. The presence of more than two copies 
of each paralogous chromosomal segment, is suggestive of at least two rounds of 
duplication which would have occurred after the divergence of the cephalochor- 
dates and cranial chordates, but before the divergence of the ray-finned and lobe- 
finned fishes, which is thought to have occurred about 420 Myrs ago. 

Several extensive regions of paralogy have been identified in the human genome 
which have been claimed to result from ancient tetraploidization events. The 13 
groups of paralogous genes found on chromosomes 4 and 5 (Zable 2.1) provide one 
example (Lundin, 1993). Lundin (1993) identified several other possible examples 
of paralogous pairs or groups of genes on different human chromosomes: (i) parts 
of chromosomes 2, 7, 12, 14, and 17, (ii) parts of chromosomes 8, 10, and 16, and 
(iii) parts of chromosomes 1, 11, 12, 15, and 19 (Table 2.2). Although the extensive 
paralogy noted between chromosomes 11 and 12 is explicable by a model of chro- 
mosome duplication resulting from tetraploidization, there are some discrepancies 
in the locations of genes on these chromosomes. These can however be accounted 
for by the occurrence of a pericentric inversion on chromosome 12. 

Paralogy may be explained by mechanisms other than tetraploidization. Indeed, 
some paralogous gene loci are explicable by regional duplication (see Chapter 9, sec- 
tion 9.5); Table 2.3 lists those identified on human chromosome 1. The relative 
importance of regional duplication/translocation as compared to tetraploidization is 
unclear and ambiguity even extends to individual cases. Thus, the relative locations 
of the tyrosine hydroxylase (TH; 11p15.5), tryptophan hydroxylase (TPH; 11p14.3- 
p15.1) and phenylalanine hydroxylase (PAH; 12q22-q24) genes have been explained 
in terms of both mechanisms (Craig et al., 1986; Ledley et al., 1987; Lundin, 1993). 

The two highly related regions on the proximal and distal long arms of human 
chromosome 21 (21q22.1 and 21q11.2) appear to have arisen as a result of an intra- 
chromosomal duplication of >200 kb (Dutriaux et al., 1994). This duplication is 
thought to have arisen between 15 and 30 Myrs ago after the separation of the 
orangutan from the other great apes (Orti et al., 1998). By contrast, the origin of the 
paralogous 2-20 Mb segments on human chromosomes 1, 6, and 9 (Banyer et al., 
1998; Endo et al., 1998; Katsanis et al., 1996) is unclear. Regardless of the mecha- 
nism, at least two intra-chromosomal duplications must have occurred resulting in 
the triplication of a series of genes, for example the retinoid X receptor genes 
RXRG (1q22-q23), RXRB (6p21.3) and RXRA (9q34.31), the pre-B cell leukemia 
transcription factor genes PBX1 (1q23), PBX2 (6p21.3), and PBX3 (9q34), and the 
tenascin genes TNR (1q25-q31), TNXA (6p21.3), and HXB (9q32-q34) (Katsanis et 
al., 1996). Interestingly, Alu- and LINE-dense clusters flank the boundaries of the 
6p21.3 segment, a finding which may be significant in view of the recombinogenic 
potential of these sequence elements. In this context, it may be significant that a 
sequence related to the pseudoautosomal boundary of the human sex chromo- 
somes (see Section 2.3.4) has also been noted at the centromeric boundary of the 
6p21.3 segment (Fukagawa et al., 1996). 


2.1.2 Consequences of genome duplications for gene evolution 


In principle, the genetic redundancy created by a genome duplication would have 
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Table 2.1. Possible paralogies between parts of 
human chromosomes 4 and 5 (after Lundin et al., 


1993) 
4 5 
FGFR3 FGFR4 
HTRIA 
ADRB2 
ADRA2C ADRA1B 
DRD5 DRD1 
QDPR DHFR 
GABRA2 GABRA1 
GABRB1 
STATH SPARC 
KIT PDGFRB 
PDGFRA CSFIR 
AREG C7 
EGF c9 
AGA HEXB 
FGF5 
FGF2 FGF1 
IF F12 
F11 GZMA 
KLK3 
CSF2 
IL2 IL3 
IL4 
IL5 
IL9 
CSF1 
MLR GRL 
ANX3 
ANX5 ANX6 


allowed evolutionary experimentation, in that while one gene copy continued to 
function as before, the other was freed to acquire mutations, irrespective of 
whether they were adaptive or inactivating (Ohta, 1989; 1991). If the newly dupli- 
cated gene acquired mutations that modified either the expression pattern of the 
encoded gene or the function of the encoded protein in an advantageous way, the 
novel allele could have become fixed in the population. Ohno (1970) expressed 
this idea rather elegantly: 


An escape from the ruthless pressure of natural selection is provided by the 
mechanism of gene duplication. By duplication, a redundant copy of a locus is 
created. Natural selection often ignores such a redundant copy, and, while 
being ignored, it accumulates formerly forbidden mutations and is reborn as a 
new gene locus with a hitherto non-existent function. Thus, gene duplication 
emerges as the major force of evolution. 


In this context, evidence for positive selection (Chapter 7, section 7.1.3) has come 
from the observation of accelerated evolution in some genes subsequent to gene 
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Table 2.2. Possible paralogies between parts of human chromosomes 1, 11, 12, 15 and 19 
(after Lundin et al., 1993) 


1 11 12 15 19 
RYR2 CACNA1C RYR1 
BCAT1 BCAT2 
ENO? ENO2 
GDH SORD 
EBVS1 EBVM1 
CEAL1 
CEA 
PTPRF NCAM1 MAG 
NCA 
PSG1-13 
TNFR2 TNFR1 
SLC2A1 SLC2A3 
SLC2A5 
GOT2L1 GOT2L3 
GOT2L2 
GNAI3 GNA12L 
GNAT2 
GNB1 GNB3 
LMO1 LMO3 
LMO2 
MYCL1 MYF5 LYL1 
MYOG MYOD1 MYF6 TCF3 
CHRM3 CHRM1,4 CHRM5 
FGF3 FGF6 
FGF4 
A2M 
PZP 
ESA4 ELAT C3 
F2 CIS KLK1 
CIR KLK2 
LIPC LIPE 
TPIT MPI GPI 
LDHAL2 LDHA, LDHC LDHB 
NRAS HRAS KRAS2 
RAPIA RAP1B RRAS 
RAB3B RAB3A 
RAB4 
CALCA IAPP 
CALCB 
PTH PTHLH 
FGR 
LCK SEA FES 
ABL2 
JUN JUNB 


JUND 
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Table 2.2 continued 
1 11 12 15 19 
INSRR ERBB3 IGFIR INSR 
NTRK1 LTK TYK2 
C8A, C8B LRP1 THBS1 LDLR 
GSTM1 GSTP1 MGST1 
C10A, C10B 
COL11A1 COL2A1 
COL8A2 
H1F2 H1F4 
PGR HMR 
RARG CRABP1 
VDR 
HKR3 W771 GLI HKR1 
HKR2 
PEPC PEPB ANPEP PEPD 
KCNC4 KCNA1,2,5 KCNA7 
KCNC2 KCNC3 
ACADM ACADS IVD 
TH, TPH PAH 
NGFB INS 
IGF2 IGF1 
INSL2 
TRV2 TRV3 
FRV1 FRV3 
PLA2G2A PLA2G1B 
PLA2G2C 
TSHB FSHB LHB 
CGB 
ACP2 ACP5 
PKLR PKM2 
ATP1A1 ATP1A3 
ATP1A2 
ATP1B1 
ATPIAL2 ATP2A2 
ATP2B2 ATP2B1 
TRAP2 TRA? TRAP1 
CD48 THY1 
CD3D A1BG 
CD3E 
CD3Z CD3G 
CD1A-E FCERIB 
FCERIA FCER2 
FCER1G 
FCGR2A, 2B, 
2C, 3A, 3B CD4 
PIGR 
APOA4 APOC? 
APOA2 APOA1 APOC2 
APOC3 APOE 
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Table 2.2 continued 
1 11 12 15 19 
FUT4 FUT1 
SIAT4C FUT2 
MANA1 MANB 
CKMT1 CKM 
GANAB GANC 
CAPN2 CAPN1 
CAPN3 CAPN4 
MUC1 MUC2, 
MUC5B 
MUC5AC 
AT3 C1NH 
AGT 
PFKM PFKX 
FDPSL1 CHR39B 
ACTA! ACTC 
POU2F1 POU2F2 
TNNI1 TNNT1 
SNRPE SNPRN SNRPA 
SNRP70 
TGFB2 AMH 
TGFB1 
CTSE CTSD CTSH 
REN PGA3-5 


duplication (Burger et al., 1994; Hill and Hastie 1987; Kurihara et al., 1997; Ohta 
1994; Wallis 1993). In such studies, positive selection is implicated in the process 
of evolutionary change by the observation of a higher frequency of non-synony- 
mous over synonymous substitutions (Hughes and Nei, 1988). Changes in the 
expression patterns of the duplicated genes are sometimes also apparent as in the 
neuronal and muscle expressed genes of the nicotinic acetylcholine receptor fam- 
ily (Le Novre and Chaneux, 1995). In principle, gene conversion (see Chapter 9, 
section 9.5) may either promote diversification of proteins encoded by duplicated 
genes (Ohta and Basten, 1992) or promote homogenization (Sidow and Thomas, 
1994) in which case the molecular evolutionary record is automatically erased. In 
practice, however, at least for the HLA and immunoglobulin genes, gene conver- 
sion is notable more by its absence: instead new genes tend to be created by a 
‘birth-and-death’ process of duplication and deletion (Nei et al., 1997). 

A more likely scenario, however, is that the duplicated gene rapidly acquires 
inactivating mutations and becomes a pseudogene. Indeed, assuming that a gene 
duplication is not selectively disadvantageous, the duplicated genes can survive 
in the genome for quite long periods (Clark, 1994; Loomis and Gilpin, 1986; 
Nowak et al., 1997). Arguably the best available model system to assess whether 
duplicated genes will be retained or inactivated over evolutionary time is yeast 
(Saccharomyces cerevisiae), the organism with the best characterized genome 
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Table 2.3. Locations of human gene loci indicating large regional duplications of chromo- 


some 1 (after Lundin et al., 1993) 


TRN tRNA, asparagine 1p36 
TRNL tRNA, asparagine-like 1q12-q22 
TRE tRNA, glutamic acid 1p36 
TREL1 tRNA, glutamic acid-like 1 1q21-q22 
RNU1 Small nuclear U1 RNA 1p36 
RNU1P1-4 Small nuclear U1 RNA pseudogenes 1-4 1q12-q22 
FGR Feline sarcoma oncogene 1p36 
LCK Lymphocyte protein tyrosine kinase 1p32-p35 
ABLL Abelson murine leukemia 1q24-q25 
oncogene-like 

C8A, C8B Complement component, 8, œ/B chains 1p22-p36 
C10A, C10B Complement component 1q, a/B chains 1p 
C4BPA, C4BPB Complement component 4 binding protein 1q32 
CR1, CR2 Complement component receptors 1/2 1q32 
AK2 Adenylate kinase 2 1p34 
GUK1 Guanylate kinase 1 1q32-q42 
GOT2L1 Glutamic-oxaloacetic transaminase 2-like 1 1p32-p33 
GOT2L2 Glutamic-oxaloacetic transaminase 2-like 2 1q25-q31 
RAB3B Member RAS oncogene family 1p31-p32 
NRAS Neuroblastoma RAS oncogene 1p13 
RAPIA Member RAS oncogene family 1p12-p13 
RAB4 Member RAS oncogene family 1q42-q43 
FTHL1 Ferritin, heavy polypeptide-like 1 1p22-p31 
FTHL2 Ferritin, heavy polypeptide-like 2 1q32-q42 
CD58 Lymphocyte function-associated antigen 1p13 
DAF Decay accelerating factor for complement 1q32 
ATP1A1 ATPase, Na*K+t, «1 polypeptide 1p13 
ATP1A2 ATPase, Na*K+t, a2 polypeptide 1q21-q23 
ATP1B1 ATPase, Na*Kt, B polypeptide 1q22-q25 


duplication. Sequencing data from the 12 Mb yeast genome (Mewes et al., 1997) 
are consistent with the occurrence of a whole genome duplication that occurred 
~100 Myrs ago, after the divergence of S. cerevisiae from Kluyveromyces (Wolfe 
and Shields 1997). Most duplicated genes were subsequently deleted; only 13% 
of yeast proteins are now represented as homologous pairs encoded by homolo- 
gous genes. Of these homologous gene pairs, only a few possess functions that 
have clearly diverged under the influence of selection viz. the mitochondrial and 
peroxisomal isozymes of citrate synthase (CITI and CIT2), RAS1 and RAS2, the 
transcription factors ACE2 and SWIS, the phosphatidylinositol kinases TORI 
and TOR2, and the myosins MYO3 and MYOS. Conclusions drawn from yeast 
may not however be applicable to vertebrates and in this phylum, many more 
genes may have survived the aftermath of duplication to acquire new functions. 
Thus, Nadeau and Sankoff (1997) analyzed the frequency distribution of family 
size for gene families present in humans and mice and which arose putatively by 
genome duplication early in vertebrate evolution. They concluded that dupli- 
cated genes were as likely to have survived and acquired a novel function as to 
have been lost through the acquisition of inactivating mutations. In agreement 
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with these findings, studies of fish (Bailey et al., 1978) and Xenopus (Hughes and 
Hughes, 1983) have both suggested that about 50% of newly duplicated genes are 
retained after tetraploidization. 


2.2 Mammalian genome evolution 


The best estimates of divergence times for the various mammalian orders and the 
other major vertebrate lineages have come from the use of extant gene sequences 
to calibrate a ‘molecular clock’ of vertebrate evolution (Kumar and Hedges, 1998). 
Gene-specific evolutionary rates often vary quite dramatically (see Chapter 7, sec- 
tion 7.1.3) but the use of multiple genes to derive mean divergence times should 
yield more accurate and reliable estimates. Kumar and Hedges (1998) therefore 
employed 658 genes from 207 vertebrate species to derive a molecular timescale 
for vertebrate evolution (Figure 2.1). The divergence times corresponded well to 
previous estimates based upon the fossil record. Thus the calculated divergence 
time for the jawless fish (Agnatha) was 564 Myrs ago in the Precambrian era. 
Interestingly, the molecular data indicated that at least five major lineages of pla- 
cental mammals [Edentata (armadillos, anteaters and sloths), Hystricognathi 
(porcupines and guinea pigs), Sciurognathi (squirrels), Paenungulata (hyraxes) 
and Ferungulata (carnivores)] could have arisen in the early to middle Cretacious 
between 130 and 90 Myrs ago. This represents an important revision of previous 
estimates of the timing of the adaptive radiation of the mammals [estimated by 
Novacek (1992) to have occurred between 100 Myrs and 65 Myrs ago]. Since this 
now appears to have predated the Cretacious/Tertiary extinction of the dinosaurs 
65 Myrs ago, the adaptive radiation of the mammals could not have been simply a 
consequence of the filling of niches vacated by the departing super-lizards. Other 
factors such as climatic change and the continental breakup must also have played 
a role (Hedges et al., 1996). This question notwithstanding, the adaptive radiation 
of the mammals has been very successful, resulting in the emergence of >4600 liv- 
ing species that occupy a very diverse range of habitats and environments. 

It has been suggested that the rate of mammalian speciation may have been influ- 
enced by the rate of karyotypic change (Bush et al., 1977; Qumsiyeh, 1994; Wilson et 
al., 1977). However, mammalian genomes still contain significant regions of genetic 
linkage that have been conserved through evolutionary time. Indeed, appreciation of 
the phenomenon of linkage group conservation between different mammalian 
species has led to the identification and chromosomal localization of novel genes in 
the human genome. Such studies also promise to aid greatly our understanding of 
mammalian genome evolution by, for example, revealing chromosomal inversions or 
translocations, duplications of genes or gene regions, or more subtle lesions that may 
have led to the functional divergence and specialization of proteins. 

Genomic mapping projects are underway for a variety of mammals including 
the mouse, rat, cow, sheep, and pig but by far the most data are available for the 
mouse (Edwards et al., 1994; Eppig, 1996; Nadeau et al., 1995; Nadeau and 
Sankoff, 1998; O’Brien et al., 1993). Comparative mapping data for a range of 
mammals including a number of primates are available at http://www.informat- 
ics.jax.org/homology.html. The comparative analysis of the human and murine 
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Figure 2.1. A molecular timescale for vertebrate evolution (after Kumar and Hedges 
1998). All times indicate Myrs separating humans (or the largest sister group containing 
humans) and the group shown, except when the comparative groups are separated by a 
slash (/). Time estimates are shown with +s.e.m. and the number of genes used is given in 
parentheses. Three groups of mammalian orders are Archonta (Primates, Scandentia, 
Dermoptera, Chiroptera and Lagomorpha), Ferungulata (Carnivora, Cetartiodactyla and 
Perissodactyla) and Paenungulata (Hyracoidea, Proboscidea, Sirenia). 
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genomes is especially important because the mouse represents a genetic system 
with a large number of spontaneous mutants (some of which are relevant to the 
study of human genetic disease), numerous inbred strains and enormous poten- 
tial for the application of transgenic technology (Rubin and Barsh, 1996). 

The latest comparative genetic map for human and mouse contains nearly 1800 
genes mapped to over 200 different syntenic chromosomal groups [DeBry and 
Seldin, 1996; http://www.ncbi.nlm.nih.gov/Homology/]. Despite a timespan of 
between 100 (Kumar and Hedges, 1998) and 80 Myrs (Collins and Jukes, 1994) 
since the divergence of the two species, some syntenic groups have maintained 
gene content, order and spacing over considerable genetic and physical distances, 
for example the 1q21-q23 region of human and mouse chromosomes 1 (Oakey et 
al., 1992) and a portion of mouse chromosome 2 with the entire human chromo- 
some 20 (DeBry and Seldin, 1996). Other syntenic groups, by contrast, have accu- 
mulated substantial differences in gene order between the two species, for 
example human chromosome 19q13 and a segment of the homologous murine 
chromosome 7 which exhibit nine separate conserved linkage groups (DeBry and 
Seldin, 1996). The human and mouse X chromosomes contain a minimum of 
eight conserved syntenic groups (Blair et al., 1994). The possible events which 
may have led to the current arrangement of homologous segments on the human 
and murine X chromosomes are shown in Figure 2.2 and discussed further in 
Section 2.3.4. 

Syntenic regions need not, however, necessarily be comparable in size. Thus, the 
human T-cell receptor B (TCRB; 7q35) region (800 kb in length) is considerably 
larger than its murine counterpart (500 kb) as a result of repeated duplications of 
specific VB segments in the primate lineage (Hood et al., 1993). Similarly, the 
human HLA class II region (6p21.3) at ~900 kb is approximately three times the 
length of its mouse equivalent as a result of the duplication of specific HLA-DP - 
DQ and -DR members of this multigene family in the primate lineage (Amadou et 
al., 1995; Hanson and Trowsdale, 1991). By contrast, the class III HLA regions of 
human and mice are remarkably similar in structure (Peelman et al., 1996). 

Many syntenic blocks extend across human centromeres (Moseley and 
Seldin, 1989). Thus, gene order in the pericentric regions of human chromo- 
somes 1, 2, 4, 6, 11, and 19 is conserved in the homologous regions of murine 
chromosomes 3, 6, 5, 1, 2, and 8, respectively (DeBry and Seldin, 1996). The 
majority of rearrangements appear to be due to inversions within chromosomes 
rather than rearrangements between chromosomes but more detailed mapping 
data will be required to obtain the map resolution necessary for a definitive 
assessment. One caveat which should be borne in mind is that the linkage map 
may have been broken up to a larger extent in rodents than in primates as com- 
pared to the ancestral mammalian genome (Lundin, 1993). One way to study 
this type of rearrangement is by the comparative analysis of the chromosomal 
locations of the individual gene loci involved. For example, Maresco et al. (1998) 
determined the locations of the high-affinity immunoglobulin receptor genes 
(FCGRIA, 1q21; FCGRIB, 1p12; FCGRIC, 1q21) in the rhesus monkey 
(Macaca mulatta), baboon (Papio papio) and chimpanzee (Pan troglodytes) thereby 
providing evidence for the occurrence of two pericentric inversions during the 
evolution of human chromosome 1. 
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Figure 2.2. Possible events that may have led to the current arrangement of homologous 
blocks on the human and mouse X chromosomes (from Blair et al., 1994). On the left, a 
postulated ancestral X chromosome is shown. Three inversion events are indicated. Each 
homologous block is shaded in the same manner throughout and the proximal and distal 
loci that define the blocks are given. Dotted lines between loci indicate the positions of 
evolutionary breakpoints. The comparative maps of the human and mouse X 
chromosomes are shown alongside their cytogenetic maps. 


Conservation of synteny may extent beyond the mammals. For example, 11/18 
genes from the chicken Z chromosome have orthologues on human chromosome 
9pter-q22, albeit in a different order (Nanda et al., 1999). What are the reasons for 
this degree of conservation? One reason may have been functional, for example in 
order to allow coordinate regulation of the genes involved (see Chapter 8, section 
8.5) or in order to avoid possible meiotic disturbance consequent to a major chro- 
mosomal rearrangement. In the case of the HLA system, synteny may have indi- 
rectly promoted the generation of diversity by optimizing the potential for gene 
conversion. Alternatively, it is possible that insufficient time has passed for ances- 
tral linkages to have been broken up completely. Synteny may be conserved in one 
phylogenetic group but not in another. Thus, the surfeit genes are tightly clus- 
tered in human (SURFI1, SURF2, SURF3, SURF4, SURFS; 9q34.1) and chicken 
but not in invertebrates where the Surf genes are unlinked (Duhig et al., 1998). 

Assuming a human-rodent divergence time of 80 Myrs, Collins and Jukes 
(1994) estimated the rate of silent substitution to be 2.9 x 10° site! year. 
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However, this is very much an average figure arrived at by comparison of the 4- 
fold degenerate sites in the protein-coding sequences of 337 human/rodent gene 
comparisons. Sequence conservation does vary quite considerably between pro- 
teins: indeed a study of 1196 orthologous mouse and human protein sequences 
revealed sequence conservation of between 36% and 100% with an average of 85% 
(Makalowski et al., 1996). 

Large scale genomic DNA sequence comparisons between human and mouse 
are still difficult owing to the paucity of orthologous pairs of sequences >20kb in 
length. However, Koop (1995) noted three distinct patterns of sequence diver- 
gence in the noncoding DNA sequences of human and rodent genomes: 


(i) A high level of sequence similarity in gene regions contrasting with divergent 
non-coding regions, for example B-globin (HBB; 11p15.5) and y-crystallin 
(CRYGA; 2q33-q35) genes. 

(ii) A conserved pattern of sequence similarity in noncoding regions, for example 
T-cell receptor-o (TCRA; 14q11.2) and -8 (TCRD; 14q11.2) genes and a- and 
B-myosin heavy chain (MYH6, MYH7; 14q11.2-q13) genes (Koop and Hood, 
1994). At least some of this conservation may be attributed to regions which 
bind T-cell nuclear proteins which may play a role in the control of gene tran- 
scription (Koop et al., 1994). 

(iii) A mixed pattern of sequence similarity, for example the immunoglobulin 
heavy chain J-Cu-Céd gene region (14q32.33): the J-Cu portion exhibits ~64% 
sequence homology between human and mouse whilst the Cô region shows 
little if any sequence conservation. 


This ‘mosaic model’ of genome evolution may reflect differing rates of mutation 
or differential repair efficiencies between different regions of the genome. As the 
sequences of further syntenic regions become available [e.g. the Bruton’s tyrosine 
kinase (BTK; Xq21.33-q22) gene region; Oeltjen et al., 1997], this question can be 
addressed. 


2.3 Primate evolution 


I believe that our Heavenly Father invented man because he was disappointed 
in the monkey. 
Mark Twain 


2.3.1 Adaptation and adaptive radiation 


The order of primates originated in the Palaeocene some 50-60 Myrs ago and 
contains the monkeys, apes and humans (Martin 1993; Figure 2.3). Originally 
adapted for arborial life, extant primates include low canopy runners (guenons), 
high canopy acrobats (spider monkeys) and the brachiating great apes whilst 
some have become exclusively terrestrial (baboons, mandrills, and humans). 
These habits are responsible for the adaptation of the skeleto-muscular system to 
allow jumping, swinging and grasping. Primates exhibit a wide variety of char- 
acteristics which have equipped them to exploit their various niches optimally. 
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The Anthropoidea, with their relatively large brain size, possess binocular vision 
and high visual acuity but a relatively poor olfactory sense. Their high intelli- 
gence allows rapid and measured reaction to external stimuli. The extension of 
the time periods for gestation, developmental growth and parental care are asso- 
ciated with increased powers of learning whilst communication by use of elabo- 
rate vocal signals has allowed the development of complex patterns of social 
behavior. The menstrual cycle with ovulation has been accompanied by the 
development of sexual behavior and signaling in the female. Finally, the omniv- 
orous diet of primates may have been responsible for the development of their 
characteristic manual dexterity which has allowed them both to explore and 
manipulate their environment. 

The order Primates contains about 200 living species which are divided into two 
suborders: the Prosimii and the Anthropoidea (Table 2.4). The Prosimii, which 
originated in the Palaeocene (Figure 2.3), have retained the insectivoran character- 
istics of long face, lateral eyes and small brain. The Anthropoidea comprise the Old 
World monkeys, the New World monkeys and the great apes. The New World or 
platyrrhine (flat nosed) monkeys are thought to have been isolated since the Eocene 
(Figure 2.3). The Old World or catarrhine monkeys share a common ancestor in the 
late Eocene (Figure 2.3) and do not differ markedly in either habits or organization 
from the New World monkeys. The great apes include the gibbon and orangutan 
from East Asia and the chimpanzee and gorilla from Africa. 


2.3.2 Primate phylogeny 


The fact that we are able to classify organisms at all in accordance with the 
structural characteristics which they present is due to the fact of their being 
related by descent. 

D.W. Thompson (1917) 


The phylogeny of the hominoid primates was initially investigated by means of 
DNA-DNA hybridization (Sibley and Ahlquist, 1984, 1987). These studies 
employed single copy nuclear DNA to calculate the temperature (T,,H) in degrees 
Celsius at which 50% of all single copy DNA sequences were in the hybrid form 
and 50% had dissociated (Figure 2.4). The delta T,,H between chimpanzee and 
human is 1.6. Assuming a relationship of delta T,,H = 1% base mismatches, this 
translates into ~3.2 x 10’ mismatches between the chimpanzee and human 
genomes. Sibley and Ahlquist (1987) estimated times of divergence for higher pri- 
mates as: Old World monkeys, 25-34 Myrs ago; gibbons, 16.4-23 Myrs ago; 
orangutan, 12.2-17 Myrs ago; gorilla, 7.7-11 Myrs ago, chimpanzees-humans, 
5.5-7.7 Myrs ago. It is now recognized that the chimpanzee is actually represented 
by two distinct species, the common chimpanzee (Pan troglodytes) and the pigmy 
chimpanzee (Pan paniscus) which diverged from each other ~2.3 Myrs ago. 

The studies of Sibley and Ahlquist (1984, 1987) received support from Caccone 
and Powell (1989). However, the validity of conclusions drawn from DNA-DNA 
hybridization data has been challenged and the interpretation of these studies is 
still somewhat contentious (Marks et al., 1988; Sarich et al., 1989; Sibley et al., 
1990). Sibley and Ahlquist’s scheme is nevertheless in broad agreement with the 
fossil record, comparative morphology, immunological studies (Gingerich, 1984) 
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Figure 2.3. Longevity of primate groups (redrawn from Carroll, 1988) 


and chromosome phylogeny (Yunis and Prakash, 1982; see Section 2.3.3) as well 
as being compatible with studies of individual DNA sequences. Thus, a similar 
picture has emerged from maximum parsimony analysis of DNA sequences 
derived from primate B-globin gene regions (Goodman et al., 1989, 1990; 
Hasegawa et al., 1987; Koop et al., 1989; Maeda et al., 1988; Miyamoto et al., 1987, 
1988), the c-myc oncogene (Mohammad-Ali et al., 1995), the ribosomal DNA 
genes (Gonzalez et al., 1990), the o-1,3-galactosyltransferase gene (Galili and 
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Table 2.4. A simplified classification of the primates (derived from Young, 1973 and 
Fleagle, 1988) 


Order: Primates 
Suborder: Prosimii 


Infraorder: Lemuriformes 
Superfamily: Lemuroidea 
Family: Omomyidae 
Family: Adapidae (extinct) 
Family: Lemuridae (lemurs) 
Family: Indriidae (indris) 
Family: Daubentoniidae [aye-aye (Daubentonia)]| 
Family: Lepilemuridae (Lepilemur) 
Superfamily: Lorisoidea 
Family: Cheirogaleidae 
Family: Galagidae [bushbaby (Ga/ago)] 
Family: Lorisidae (Loris, potto) 
Infraorder: Tarsiiformes 
Family: Tarsiidae (tarsier) 


Suborder: Anthropoidea 
Infraorder: Platyrrhini (New World monkeys) 
Superfamily: Ceboidea 
Family: Callitrichidae (marmosets) 
Family: Cebidae [capuchins (Cebus), owl monkeys (Aotus)] 
Family: Atelidae [spider monkeys (Ateles), howler monkeys (A/ouatta)] 
Infraorder: Catarrhini (Old World monkeys) 
Superfamily: Cercopithecoidea 
Family: Parapithecidae (extinct) 
Family: Cercopithecidae 
Subfamily: Cercopithecinae [macaques (Macaca), baboon (Papio), guenon 
(Cercopithecus), mandrill, langurs (Presbytis) 
Subfamily: Colobinae [colobus monkeys (Co/obus)] 
Superfamily: Hominoidea 
Family: Hylobatidae [gibbon (Hy/obates)] 
Family: Pongidae [Apes; orangutan (Pongo pygmaeus), gorilla (Gorilla gorilla), chim- 
panzee (Pan troglodytes), pigmy chimpanzee (Pan paniscus)] 
Family: Hominidae [Human (Homo)] 


Readers requiring a much more detailed classification of the primates are referred to Groves (1997) or to 
the Smithsonian Institution Website at http://nmnhwww.si.edu/cgi-bin/wdb/msw/children/query/11002. 


Swanson, 1991), the cytochrome P450 CYP21 gene (Kawaguchi et al., 1992) and 
mitochondrial DNA (Bailey et al., 1991; Brown et al., 1982; Hixson and Brown, 
1986; Horai et al., 1992; Perrin-Pecontal et al., 1992; Ruvolo et al., 1991; Saitou 
1991; Saitou and Nei, 1986) as well as from transversion rates in nuclear and mito- 
chondrial DNA (Holmquist et al., 1988). 

However, divergence data from primate protamine P1 (Retief et al., 1993) and 
a-fetoprotein (Nishio et al., 1995) gene sequences place gorilla closer to human 
than to chimpanzee. These gene sequences are known to have evolved extremely 
rapidly and the observed differences may well have been stochastic changes with 
no biological consequences for the species concerned or implications for their 
phylogeny. Similar explanations probably also apply to other examples of appar- 
ent gorilla-human closeness, for example studies of the mitochondrial genome 
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Figure 2.4. Phylogeny of the hominoid primates as determined by average linkage 
clustering of delta T,,H values derived from DNA-DNA hybridization (redrawn from 
Sibley and Ahlquist, 1987). 


(Ruvolo et al., 1991; Horai et al., 1992) and a polymorphism at the HLA-DQA 
locus (Gyllensten and Erlich 1989). Studies of immunoglobulin € pseudogenes 
have been equivocal (Ueda et al., 1985, 1989). Taken together, it is clear is that the 
split between humans, chimps and gorillas was very close and that it is not unrea- 
sonable to expect that different sequences will have diverged to variable extents in 
the different lines. Thus the importance of the recent data of Kumar and Hedges 
(1998) discussed in Section 2.2. These data yielded slightly revised divergence 
times for higher primates with tighter error margins: gibbons, 14.6 + 2.8 Myrs 
ago; orangutan, 8.2 + 0.8 Myrs ago; gorilla, 6.7 + 1.3 Myrs ago, chimpanzees- 
humans, 5.5 + 0.2 Myrs ago. 

The rates of accumulation of mutations appear to vary by as much as seven-fold 
between different primate lineages (Koop et al., 1989). The line of descent from 
the primate node to humans shows a slowdown in evolutionary rates from 7.7 Xx 
10° fixed changes site! year! for the first 15 Myrs (55—40 Myrs ago) to 1.3 x 10° 
for the next 15 Myrs (40-25 Myrs ago) to 1.0 x 10” for the last 25 Myrs (Koop et 
al., 1989). The average evolutionary rate for the hominoids (1.1 x 10-°) is lower 
than the rates for macaque, a catarrhine (1.9 x 10°) and for spider monkey, a 
platyrrhine (1.8 x 10-’). By comparison, the line of descent from primate node to 
Tarsius shows an evolutionary rate of 3.4 x 10°? fixed changes site! year! which is 
approximately half the stem-simian rate (Koop et al., 1989). The hominoid slow- 
down is at its greatest in human (Li and Tanimura, 1987) although anatomically, 
humans are quite divergent. Clearly, changes in certain key genes must have 
assumed a critical importance. As if perhaps to emphasize this point, some gene 
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sequences appear to buck the trend; thus, the evolutionary rate of the noncoding 
region of the immunoglobulin-a gene is greater in hominoids than in Old World 
monkeys (Kawamura et al., 1991). 


2.3.3 Chromosome evolution in primates 


Old World primates 


In dim outline, evolution is evident enough but that particular and essential 
bit of the theory of evolution which is concerned with the origin and nature of 
species remains utterly mysterious. 

W. Bateson William Bateson, Naturalist (1928) 


The evolution and probable phylogeny of primate chromosomes has been exten- 
sively reviewed by both Rumpler and Dutrillaux (1990) and Clemente et al. 
(1990). The interested reader is referred to these reviews for detailed accounts. 
Some chromosomes appear to have been relatively protected from change during 
primate evolution, for example human chromosomes 19 and X. By contrast, other 
chromosomes have been prone to significant reorganization, for example human 
chromosomes 1, 3, and 7 (Figure 2.5). 

The most frequent types of chromosomal change detected in primate evolution 
are inversions (especially pericentric, see Haaf and Bray-Ward, 1996), changes in 
the amount and localization of heterochromatin, fusions and fissions, and 
changes in the location of centromeres due to activation/inactivation. Reciprocal 
translocations, deletions and insertions are much less frequent. Human chromo- 
some 18 differs from the homologous chromosomes in the great apes by a peri- 
centric inversion and it is thought that one inversion breakpoint may have been 
located at or within the centromere (McConkey, 1997). Pericentric inversion 
breakpoints have also been identified on the chimpanzee equivalents of human 
chromosomes 4 (4p14, 4q21), 9 (9q22), and 12 (12p12 and 12q15) and these appear 
to coincide with the locations of either fragile sites or tumor-associated break- 
points (Nickerson and Nelson, 1998). Pericentric inversions may have played an 
important role in establishing reproductive isolation and speciation during the 
evolution of the higher primates. 

There are at least a dozen blocks of X-Y sequence homology outwith the 
pseudoautosomal region of humans but these blocks occur in a very different order 
and orientation on these chromosomes (Vogt et al., 1997). This may be accounted 
for in terms of the occurrence of a number of different inversions, transpositions 
and other rearrangements during primate evolution (Bickmore and Cooke, 1987; 
Lambson et al., 1992; Mumm et al., 1997; Page et al., 1984; Yen et al., 1988). One 
example of a human-specific inversion is that involving the short arm of the Y 
chromosome (Schwartz et al., 1998). Since humans from different racial groups 
(Caucasian, African, and Asian) all possess this Yp inversion, the rearrangement 
must have occurred prior to the divergence of the human racial groups. 

Studies of chromosome banding patterns and hybridization homologies 
between ape and human chromosomes have provided evidence for human chro- 
mosome 2 having arisen from the fusion of two ancestral simian chromosomes. 
IJdo et al. (1991) showed that this probably occurred by telomere—telomere fusion 
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Figure 2.5. Chromosome evolution in primates. Evolution of chromosomes 1, 2, 3, and 7. 
The main reorganizations that took place are indicated in the phylogenetic trees (redrawn 
from Clemente et al., 1990). 


at 2q13 rather than by translocation after chromosome breakage. This fusion has 
subsequently been confirmed by chromosome painting (Arnold et al., 1995; 
Wienberg et al., 1994) and since it accounts for the reduction in chromosome 
number from 24 pairs in the great apes (chimpanzee, orangutan, and gorilla) to 23 
pairs in humans, it must have been a relatively recent event. Fusion must have 
been accompanied or followed by inactivation or removal of one of the ancestral 
centromeres. Consistent with this postulate, IJdo et al. (1991) found evidence by 
hybridization for the residual presence of an ancestral centromere at 2q21. 
Clearly, this reduction in chromosome number would have been a critical event 
during the speciation process; if it was not in itself responsible for bringing about 
reproductive isolation, it would certainly have helped to maintain it. 

Another major chromosomal rearrangement to have occurred in the great apes 
is to be found in the gorilla. Using human chromosome-specific libraries as 
probes for in situ hybridization, Stanyon et al. (1992) described a reciprocal 
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translocation in the gorilla lineage which is not present in the chimpanzee. Thus, 
chromosomes 4 and 19 of the gorilla were derived from a reciprocal translocation 
between the ancestral chromosomes homologous to human chromosomes 5 and 
17 (Figure 2.6). Wienberg et al. (1990) used chromosomal in situ suppression 
hybridization to demonstrate that the centromere of human chromosome 17, the 
long arm of the chromosome and a small part of the short arm all contribute to 
gorilla chromosome 4 whilst most of the short arm of chromosome 17 contributes 
to gorilla chromosome 19. 

Probably the best understood human chromosome in terms of its evolution is 
chromosome 21 (Richard and Dutrillaux, 1998; Figure 2.7). The equivalent of 
human chromosome 21 (HSA21) formed a large and unique chromosome 
together with chromosome 3 (HSA3) in the eutherian ancestor. This chromosome 
was conserved without significant alterations only in lemurs, the civet and the 
pig. It underwent inversions in the tree shrew and the cow. Various translocations 
involving the portion corresponding to HSA3 occurred in the brown lemur, cat, 
rabbit and mouse. In the primates, two independent fissions occurred. In New 
World monkeys, a small segment of HSA3 remained attached to HSA21 and this 
chromosome then underwent further rearrangements: an inversion in the mar- 
moset, the addition of heterochromatin in the capuchin monkey and a transloca- 
tion in the saki monkey. HSA21 was formed in the common ancestor of Old 
World monkeys and underwent translocations with various equivalents of human 
chromosomes in all the Cercopithecidae. HSA21 was conserved without visible 
alteration in the black gibbon and the great apes. 

One technique which is proving extremely useful in primate cytological studies 
is cross-species chromosome painting (CSCP; also known as comparative paint- 
ing or ZOO-FISH). CSCP involves the hybridization of a chromosome-specific 
paint from one species (usually human) onto metaphase spreads of another 
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Figure 2.7. Evolution of the equivalent of human chromosome 21 (HSA21) in eutherian 
mammals (after Richard and Dutrillaux, 1998). HSA21 (open rectangle) formed a large 
and unique chromosome with HSA3 (dark rectangle) in the eutherian ancestor. This 
chromosome was conserved without significant alterations only in Microcebus murinus 
(lesser mouse lemur) and Cheirogaleus major (greater dwarf lemur), Paradoxurus 
hermaphroditus (common palm civet) and Sus scrofa (pig). It underwent inversions (open 
circles) in Tupaia glis (tree shrew) and Bos taurus (cow) or various translocations involving 
the portion corresponding to HSA3 (dark squares) in Eulemur fulvus (brown lemur), Felis 
catus (cat), Oryctolagus cuniculus (rabbit), and Mus musculus (mouse). In the primates, two 
independent fissions (arrowed) occurred. In the Platyrrhini (New World monkeys), a 
small segment of HSA3 remained attached to HSA21 and this chromosome then 
underwent further rearrangements: an inversion in Callithrix jacchus (marmoset), the 
addition of heterochromatin in Cebus capuchinus (capuchin monkey) and translocation in 
Pithecia pithecia (saki). HSA21 was formed in the common ancestor of all Catarrhini (Old 
World monkeys) and underwent translocations (open squares) with various equivalents of 
human chromosomes (grey rectangles) in all Cercopithecidae viz. Macaca sylvana 
(Barbary ape), Cercopithecus mona (Mona monkey), Colobus abyssinicus (northern black and 
white colobus), Symphalangus syndactylus (simiang) and Hylobates lar (white-handed 
gibbon). HSA21 was conserved without visible alteration in Nomascus concolor (black 
gibbon) and in the great apes (Gorilla gorilla, Pan troglodytes and Pongo pygmeus). The 
number of the carrier chromosome of each species is indicated. 
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species. This approach allows the rapid construction of chromosome maps from 
the primate species in question which can reveal cytogenetic homologies with the 
human karyotype (Wienberg and Stanyon, 1997). By these means, a high degree 
of synteny has been found between human and baboon chromosomes (Rogers et 
al., 1995). By contrast, numerous translocations are apparent in the gibbon 
(Hylobates lar, 2n = 44) genome as compared to human and the great apes (Jauch 
et al., 1992; Arnold et al., 1996); the 22 human autosomes have to be divided into 
51 elements in order to recombine them into the 21 gibbon autosomes. Similarly, 
in the Concolor gibbon (Hylobates concolor, 2n = 52), the 22 human autosomes 
have to be divided into 63-67 segments in order to recombine them into the 25 
gibbon autosomes (Koehler et al., 1995). 

Despite the degree of genetic similarity between the great apes, the chim- 
panzee genome is approximately 10% larger than that of the human or gorilla 
(Pellicciari et al., 1982). Nevertheless, the structure of the human and chim- 
panzee genomes is highly conserved at both chromosomal and sub-chromosomal 
levels (Jauch et al., 1992; Ried et al., 1993). Even at lower resolution, the extent of 
evolutionary conservation is readily apparent. Human microsatellite DNA 
sequences are sufficiently well conserved in chimpanzees that human PCR 
primers can be used to amplify (CA), repeats in the chimpanzee (Blanquer- 
Maumont and Crouau-Roy, 1995; Deka et al., 1994; Garza et al., 1995). 
Differences in microsatellite allele length between humans and other primates 
have however been noted (Ellegren et al., 1995; Garza and Freiner, 1996; 
Rubinztein et al., 1995). Crouau-Roy et al. (1996) studied microsatellites within a 
30 cM region of human chromosome 4p and found that all informative loci 
which are linked in human were also linked in the chimpanzee, indicating that 
evolutionary conservation extends to the locus level. In general, heterozygosity 
was found to be greater in chimpanzees, a reflection perhaps of the greater 
genetic diversity in chimpanzee populations. Some loci, however, appeared to be 
less heterozygous than in human, a phenomenon that appears to be caused by 
interruptions of the repeat elements at these loci. 

Human chromosomes are not always identical. Indeed many exhibit hetero- 
morphism, especially in the centromeric and satellite regions of the acrocentric 
chromosomes. Chromosomes 1, 9, 13, 14, 15, 16, 19, 21, 22, and Y are the most het- 
eromorphic whilst chromosomes 2-8 and X are the least heteromorphic (Park et 
al., 1998; Samonte et al., 1996; Trask et al., 1989a). Inter-chromosomal variation 
can be substantial; two homologues of chromosome 21 having been noted to vary 
in healthy individuals by as much as 21 Mb or 40% of the length of the chromo- 
some (Trask et al., 1989a). Family studies have shown that such heteromorphisms 
are not artefactual and can be inherited in mendelian fashion (Trask et al., 1989b). 
It would appear that this variation can be largely ascribed to variation in the size 
of repeat sequence arrays and probably results from unequal crossing over 
between different classes of repetitive element. Bivariate flow karyotyping has 
been used to study the relative DNA content of homologous chromosome pairs in 
individuals from different racial groups (Mefford et al., 1997). Significant varia- 
tion in DNA content, ranging from 10 to 40%, was found for chromosomes 1, 13, 
14, 15, 16, 19, 21, 22, and Y. However, the spectrum of variation observed in the 
different racial groups was very similar. 
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New World primates. Like gibbons, many New World primates possess highly 
rearranged genomes. All New World primates have a translocation between chro- 
mosomes 8 and 18. The Cebidae possess translocations between chromosomes 
10/16 and 2/16 (Richard et al., 1996) whilst the Atelidae are characterized by 
translocations between 3/15 and 4/15. Significant synteny is nevertheless apparent 
between human and the capuchin monkey, Cebus (Richard et al., 1996), between 
human and the Colobus monkey (Bigoni et al., 1997), between human and the 
howler monkey (Consigliere et al., 1996), between human and the black-handed 
spider monkey, Ateles geoffroyi (Morescalchi et al., 1997) and between human and 
the marmoset, Callithrix (Sherlock et al., 1996). 


2.3.4 Evolution of the human sex chromosomes and the pseudoautosomal 
regions 


The human sex chromosomes are heteromorphic; the X chromosome contains 
~160 Mb DNA and perhaps 3000 genes whereas the Y chromosome contains only 
60 Mb DNA and probably only a handful of genes. The Y chromosome is largely 
composed of constitutive heterochromatin harboring different families of repeti- 
tive DNA (Wolf et al., 1992). Despite the size difference between the sex chromo- 
somes, they are nevertheless able to pair successfully during meiosis. 
Recombination, however, is largely confined to the two pseudoautosomal regions 
(PARs); a major PAR (2.6 Mb in size) at the tips of the short arms of the X and Y 
chromosomes which is the site of an obligate crossover during male meiosis 
(Simmler et al., 1995), and a minor PAR (320 kb) at the tips of the long arms of the 
X and Y chromosomes (Kvaloy et al., 1994). Regular X-Y recombinational 
exchanges have served to maintain homology between the chromosomes in the 
PAR regions (Graves et al., 1998a). 

Gene mapping studies have shown that part of the eutherian (placental) 
mammalian X chromosome (‘conserved region’; XCR) is shared by the X chro- 
mosomes of marsupials and monotremes (reviewed in Wilcox et al., 1996). 
Since a series of genes on the short arm of the human X are clustered in two 
autosomal groups in marsupials and monotremes, these loci define a region 
(XRA) that has been recently added to the X chromosome in eutherian mam- 
mals. Thus, the X chromosome of the common mammalian ancestor was 
smaller than that found in extant eutherians and at least two autosomal regions 
have since been added to it. The location of this X-autosome fusion corre- 
sponds to the border between the XCR and the XRA and lies at Xp11.23 
(Wilcox et al., 1996; Figure 2.8). 

The human X and Y chromosomes also exhibit substantial homology outwith 
the pseudoautosomal region (Bardoni et al., 1992; Lambson et al., 1992; Vogt et al., 
1997). Although this homology is consistent with these chromosomes having 
once constituted a homomorphic pair (Graves et al., 1998b), the observed homolo- 
gies also reflect intra-chromosomal duplication events followed by inter-chromo- 
somal translocations. For example, the minor PAR is thought to have originated 
during human evolution by the translocation of 320 kb of X chromosomal 
sequence to the Y chromosome, an event which may have been mediated by 
recombination between LINE elements (Kvaloy et al., 1994). 
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Figure 2.8. (a) Evolutionary origins of the human X chromosome. XCR, conserved region 
of the X. XRA, recently added region. (b) Details of original gene order at the fusion 
point. (c) Possible inversion event that placed the XCR genes, UBE1 and ARAFI, 
between the XRA genes, SYNI and MAOA (redrawn from Wilcox et al., 1996). 


The sex chromosomes however vary dramatically in terms of their evolu- 
tionary conservation. The Y chromosome has undergone significant changes 
during mammalian evolution (Glaser et al., 1998). By contrast, the X chromo- 
some exhibits conservation of synteny between human and mouse (Ohno’s 
Law; Ohno, 1967). This is considered to be a consequence of X inactivation 
because X-autosome translocations would have tended to be disadvantageous 
by virtue of their interference with the dosage compensation mechanism. 
Despite this, gene order on the X chromosome has changed significantly 
between mouse and human with numerous inversions altering the relative 
position of genes (Bardoni et al., 1991). 

The human Y chromosome contains a number of active genes (Lahn and Page 
1997; Vogt et al., 1997). Most notably it hosts the rapidly evolving sex determin- 
ing gene, SRY (Yp11.3; Pamilo and O’Neil 1997; Whitfield et al., 1993) which 
occurs in the sex chromosome-specific region, about 5 kb from its boundary with 
the major PAR. In human, a few X-linked genes have functional homologues on 
the Y chromosome (e.g. CSF2RA (Xp22.32, Yp11.3), MIC2 (Xp22.33, Yp11.3), 
RPS4X (Xq13.1) and RPS4Y (Yp11.3), ZFX (Xp21.3-p22.1) and ZFY (Yp11.3), 
AMELX (Xp22.1-p22.31) and AMELY (Yp11.2)). Some Y chromosome homo- 
logues are however nonfunctional, for example the KALI pseudogene on Yq11.2 
(Incerti et al., 1992) or the XG pseudogene on Yq11.21 (Weller et al., 1995). One 
consequence of the sequence identity of the PARs between the sex chromosomes 
is that the X chromosomal homologues escape X inactivation in females thereby 
ensuring that gene dosage is maintained at the level found in males. 

Genes identified in the nonrecombining portion of the Y chromosome (NRY) 
appear to fall into two distinct classes (Lahn and Page, 1997): 
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(i) Those that are specifically or predominantly expressed in the testis [sex 
determining gene (SRY; Yp11.3), deleted in azoospermia (DAZ; Yq11), 
RNA-binding motif protein 1 (RBM1; Yq11), testis-specific protein (TSPY), 
chromodomain Y (CDY), basic proteins Y1 and Y2 (BPYI, BPY2), XK 
related Y (XKRY), tyrosine phosphatase PTP-BL related Y (PRY) and testis 
transcripts Yl and Y2 (TTY1, TTY2)]. That these genes have not only been 
retained on the Y chromosome, but in two cases have also been amplified, 
may have been part of an evolutionary strategy to optimize male reproductive 
fitness. 

(ii) Those that are widely or ubiquitously expressed and which have closely 
related counterparts on the X chromosome [dead box Y (DBY), thymosin B4 
(TB4Y), translation initiation factor 1A (EIFIAY), ubiquitous TPR motif Y 
(UTY) and Drosophila fat facets-related (DFFRY; Yq11.2), AMELY (Yp11.2), 
RPS4Y (Yp11.3), zinc finger protein Y (ZFY; Yp11.3) and SMCY; Lahn and 
Page, 1997]. Conservation of specific XK-Y gene pairs may have been associ- 
ated with a requirement to maintain comparable expression levels for certain 
housekeeping genes between males and females. Consistent with the predic- 
tions of this postulate, the X chromosome homologues of these Y-borne genes 
escape X inactivation. 


The mammalian sex chromosomes are thought to be descended from a homolo- 
gous pair of autosomes (reviewed by Ellis 1996; Graves et al., 1998a, 1998b; Wolf 
et al., 1992). This process could have been initiated with the evolutionary appear- 
ance of the testis-determining SRY gene on the nascent Y chromosome, probably 
by duplication, translocation and subsequent divergence of the X-linked SOX3 
(Xq26-q27) gene. Suppression of recombination with its homologous chromo- 
some (the nascent X) then led to the gradual degeneration of the Y chromosome 
owing to its inability to segregate genes carrying deleterious alleles (‘Miiller’s 
ratchet’; Charlesworth, 1978). Evidence for this degenerative process comes from 
several sources. The rate of nucleotide substitution in Y chromosome genes 
appears to be ~2-fold higher than the rate exhibited by X chromosomal genes 
(Pamilo and Bianchi, 1993; Shimmin et al., 1993) although the frequency of DNA 
sequence polymorphism may be lower in the sex-specific region of the Y chromo- 
some than in the PARs (Allen and Oster, 1994; Whitfield et al., 1995). The Y chro- 
mosome also exhibits a high frequency of retroviral insertion in humans, 
chimpanzees and orangutans (Kjellman et al., 1995). Finally, there is emerging 
evidence for gene loss from the Y chromosome during mammalian evolution. In 
mouse and human, the ubiquitin-activating enzyme (UBE]/) gene is located on 
the X chromosome (Xp11.23-p11.3). A copy of the Ubel gene is also located on the 
Y chromosome in the mouse, ring-tailed lemur, squirrel monkey (Saimiri sciureus) 
and marmoset (Callithrix jacchus) but not in the Old World monkeys, chimpanzee 
or human, indicating loss of the Y-linked gene >35 Myrs ago during the evolution 
of the primates (Mitchell et al., 1998). Similarly, the Y-linked copy of the EJF2S3 
gene (Xp22.1-p22.2) encoding the eukaryotic translation initiation factor EIF-2y 
was lost 35-60 Myrs ago in a common ancestor of the simian primates (Ehrmann 
et al., 1998). The gradual degeneration of the Y chromosome implies that the 
retention of functional gene copies on this chromosome for a significant period of 
evolutionary time could have conferred some selective advantage. 
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It should however be noted that the Y chromosome can also acquire genetic 
material from other chromosomes. For example, the multicopy DAZLI1 gene 
(Yq11.23; deleted in azoospermia) was transposed to the Y chromosome from an 
autosome during primate evolution (Glaser et al., 1998; Saxena et al., 1996; Shan 
et al., 1996) as was the multicopy RNA-binding motif (RBM1; Yq11) gene (Chai 
et al., 1998; Delbridge et al., 1997). It may be that transfer to a male-specific loca- 
tion provided protection against inactivation or loss. Another example of the 
duplicational transposition of a gene to the Y chromosome is that of AMELX 
(Xp22.1-p22.31) and its Y-chromosome counterpart AMELY (Yp11.2); the latter 
gene, which appears to be fully functional, is present on the Y chromosomes of 
bovids and primates but not rodents thereby dating the transpositional event to at 
least 40 Myrs ago (Toyosawa et al., 1998). 

In humans, the XG blood group gene (Xp22.32) spans the major PAR on the X 
chromosome—the first three exons are pseudoautosomal whereas the remaining 
seven are X chromosome-specific (Weller et al., 1995). In humans and the great 
apes, an Alu sequence is located at the boundary between the major PAR and the 
Y chromosome-specific DNA (Ellis et al., 1990) but this sequence is not present 
in Old World monkeys. The Alu sequence was therefore inserted into the pre- 
existing boundary after the divergence of the great apes from the Old World 
monkeys. Although it did not create the boundary, the Alu sequence does serve 
to demarcate it. 

Ellis et al. (1994) proposed a model for the formation of the boundary of the 
major PAR. They hypothesized a pericentric inversion of the Y chromosome with 
one breakpoint in the ancestral XG gene and the other breakpoint 5 kb distal to 
the ancestral SRY gene. In a refinement of this postulate, Fukagawa et al. (1996) 
suggested that the inversion occurred by illegitimate recombination between two 
PAR boundary sequences, one in the ancestral XG gene and the other near the 
ancestral SRY gene. 

The PAR has undergone quite rapid change during mammalian evolution 
involving both gene duplication and translocation events in the region (e.g. STS, 
MIC2, XG, CSF2RA, IL3RA, ARSD, ARSE; Meroni et al., 1996; Ried et al., 
1998) and resulting in the movement of the PAR boundary to create X-unique 
regions (Perry et al., 1998). The evolution of the PAR and the divergence of the 
mammalian X and Y chromosomes may be viewed in terms of the ‘addition-attri- 
tion’ hypothesis (Graves, 1995; Graves et al., 1998a). This states that the incorpo- 
ration of autosomal sequences into the PAR of either the X or Y chromosome 
initially served to generate homologous regions which could pair at meiosis. 
Recombination with an homologous partner could then result in PAR enlarge- 
ment. Alternatively, the steadily accumulating mutations on the Y chromosome 
would have served to decrease the level of homology to the X chromosome 
thereby reducing PAR size. Fukagawa et al. (1996) proposed a further twist to this 
argument in that once divergence had reached a certain level, recombination fre- 
quency would have decreased thereby further increasing the rate of divergence. 

Evidence in favor of the addition-attrition theory comes from the dynamic 
nature of the major PAR region during mammalian evolution. Thus, the STS 
gene which is X-linked in humans and the great apes (Xp22.32) is autosomal in 
prosimians as is the ANT3 gene which is pseudoautosomal (Yp11.3) in humans 
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(Toder et al., 1995). Similarly, the human pseudoautosomal genes CSF2RA 
(Xp22.32/Yp11.3), SHOX (Xp22/Yp11.3) and IL3RA (Xp22.3/Yp11.3), are auto- 
somal in the mouse (Ellis, 1996). Further evidence for the process of attrition may 
come from the finding that whereas the Fxy gene spans the PAR on the murine X 
chromosome, its human counterpart (FXY; Xp22.3) lies proximal to the human 
PAR (Perry et al., 1998). 

The ‘X-driven’ hypothesis of Graves (1995; 1998) essentially proposes that the 
rapid evolutionary spreading of X inactivation preceded the decay of Y chromo- 
somal genes and even drove its initial steps. This hypothesis predicts that inacti- 
vated X-linked genes with functionally comparable Y-linked homologues should 
exist as evolutionary intermediates, but as yet no such gene has been found in any 
mammalian species. Indeed, a considerable number of human X-linked genes 
escape X-inactivation but have no detectable Y-borne counterpart. 

An alternative ‘Y-driven’ pathway of X-Y gene evolution has been proposed by 
Jegalian and Page (1998). Briefly, these authors suggested that many extant genes 
represent intermediates on a general pathway by which X-Y genes or gene clus- 
ters evolved from autosomal genes (Figure 2.9). Autosomal genes would have 
entered the pathway either by virtue of their presence on the emergent sex chro- 
mosomes or via translocation of an autosomal gene. This would have been fol- 
lowed by suppression of X-Y recombination. These steps occurred either at the 
chromosomal or sub-chromosomal level and gave rise to functionally equivalent 
X-linked (but not inactivated) genes and Y-linked genes. Subsequently, three dif- 
ferent processes (Y gene decay, upregulation of X-linked gene expression, and X- 
inactivation) interacted resulting in an inactivated X-linked gene accompanied by 
the loss of the Y gene. Expression of the X-linked gene then increased as an adap- 
tation to the reduced or restricted expression of its Y-linked counterpart. This 
compensated for the loss of Y gene function and restored optimal expression lev- 
els in males. X-inactivation, on the other hand, may be viewed as a counter- 
response which restored optimal expression levels in females. This explanation 
could in principle account for most X-linked and X-Y homologous genes in extant 
mammals, many of which exist at intermediate steps in the pathway. Jegalian and 
Page (1998) pointed out that only one gene cannot be accommodated in their 
pathway schema: the human pseudoautosomal gene, SYBL1, which is X-inacti- 
vated and transcriptionally silenced on the Y chromosome. A model for the evo- 
lution of the mammalian sex chromosomes summarizing the above processes is 
presented in Figure 2.10. 


2.3.5 Evolution of the mitochondrial genome 


The 16 569 bp of the human mitochondrial genome encodes 13 polypeptides, all 
subunits of the enzyme complexes of the pathway of oxidative phosphorylation, 
and a total of 22 tRNAs. The mitochondrial genome is characterized by its high 
proportion of coding DNA, the paucity of repetitive DNA sequence, the absence 
of introns within its genes and its own genetic code distinct from that of the 
nuclear genome (Kurland, 1992). Mitochondrial genes experience a mutation rate 
that has been estimated to be up to 17 times higher than the corresponding rate 
for nuclear genes (Wallace et al., 1987). This is thought to be due to the fact that 
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Figure 2.9. A proposed Y-driven pathway for X-Y gene evolution in mammals (after 
Jegalian and Page, 1998). 


the mitochondrial genome undergoes many more rounds of replication but may 
also be a consequence of the relative error proneness of the mitochondrial DNA 
polymerase y and increased exposure to the potentially mutagenic products of 
oxidative metabolism. Whatever the explanation, one consequence of the higher 
mutation rate has been that the mitochondrially encoded subunits of the oxida- 
tive phosphorylation enzyme complexes have evolved at a much higher rate than 
their nuclear encoded counterparts (Shoffner and Wallace, 1990). 

The origin of mitochondria is thought to have been through endocytosis by an 
anaerobic eukaryote of an aerobic eubacterium possessing an oxidative phosphoryla- 
tion system (Gray, 1992). During the evolutionary transformation of endosymbiont 
to organelle, there has been significant transfer of DNA sequences from the mito- 
chondrial to the nuclear genome (Hu and Thilly, 1994; Sadlock et al., 1993; Sorenson 
and Fleischer, 1996). Interestingly, the mitochondrial genome of the slime mould 
Dictyostelium has retained a gene encoding a NADH: ubiquinone oxidoreductase 
subunit which was transferred to the nuclear genome in the common ancestor of 
other eukaryotes (Cole et al., 1995). Some mitochondrial DNA sequences present in 
the human genome as pseudogenes have been incorporated only relatively recently 
in primate evolution (Collura and Stewart 1995; Wallace et al., 1997; Zischler et al., 
1995). It has been suggested that coevolution may have occurred between nuclear 


EVOLUTION OF THE HUMAN GENOME — CHAPTER 2 83 


"OYI SII] [NW 01 olqns soulosoq [eIIOIeUI polwdoTsueI] ATMOU 9], ‘poedojsued 

3q ULI [LLIN eUI [BWOSOINe YYA 07 SINS Iq ULI SYY_ (9) ‘SNIOT Z SIX 0Y? Aq poy[oruo0s sem ssoo01d UOTIBATIOVUT IYI JO UOVU] ‘SALW UT PIAJOAI 
IWOSOWOIYI K IUO JO UONVATIOVUI JY J, (P) 'SISOTUW 18 SIWOSOWOIYI XS IY} JO U01LJIZƏS IYI 107 pƏINDəƏI st pue (YVdq) SUOTsaI pewosoyneopnəsd 
IY? 0] PIUTJUOD ST IIAO BUISSOID ‘SIEWI UT UOISSIIdXƏ JO [OAI I1IYI JUISTBI ÁqIIIYI ‘souds JEWOSOWOIYI K IYI JO UOISSIIdXƏ pPƏSLIAIUT Aq 
poDULJeqIIJUNOI SEM SIUIF JEWOSOWOIYI K JO JOQUINU JUISLADUI UL JO AISOBAZIWIOY WOJ BUTI[NSII SIEU JO SSIUVNJ JO SSO] SUT, “WOTIDITIS 07 PƏAqns 
1OU 31g 1UIWIZ9s passo1ddns-UONeUIQuUIOdaI JYI UI SUONLIMU JAISSIIOY `IIYIILI S JIT[N se UmMouy ssod01d IY) Aq 1uəwudoqə4əp Tenxas 01 paweporun 
suas AULUL SY] UT SUOTIYINU AISSI JO SUTOSOUTOIYS X M Aq WOTIIsInbde oy] sem UOTssoiddns UorleUurquIOdSeI Jo əvuənbəsuoə Iy, (9) 'pəssəxddns 
SEM WOTBUIQUIOIOL YSTYAM UT WdISAs ITIOTOW L 107 PAAS Ud] 919M SIUIJ ITUDIUAS JO SUOTIVUTQUIO’) ‘soTeU UOdN s8elUeApPe IANS B PITIIJUOD 
IPL W I? YIM 19419301 Jey] JUF [BUOTIIpPpe IUO Jsevo] 18 porINboe swIOsoWOIYS 4-0101d IY) USYM ULZdq UOTeMUSIOYIP IWIOSOWOIYD x9S (q) 

*XOS OTIOUTBZOIOIOY IY] ZUIOG ALU OY] YUM “WF pure q ‘WoIsAs JLIP Iduns e Aq POUTWIOI]Op sem XƏS ‘AT][RTIIU] ‘souosoIne AIeUIpIO Jo 11ed g Woy 
POA[OAD SOWIOSOWOIYS XIS ULIJEWWELW (L) (8661 ‘SIHA WOT] UMBIPOT) SOUIOSOUIOIYD XI ULI[UWILUT IYI JO UOTINJOAD IY? 107 [OPOW 'OT'Z OMIY 


perennoeui E eəsop-ei6uıs E E S esop-eiqnoa EJ uoneinw EJ 


uoneJouaap A 


8|GeIA pue əyə; pue eBesop eueb-x UOeUIWWA}ap 
aye yey) Suoedo|sue.} uonenyoeul-x paseeJoul ‘səwosowoiyo uolsseiddns xXƏS JO 
awosojne-, pue ewosojne-x jeuoibay xəs əy} UBEMIAQ 1EM uoneuquoooy wajsfs o19|I'7 
3 ə 
SEET + Hva 
X <> ł 
yo uo yo uo t t 
x LSIX LSIX LSIX LSIX x <> 1 sewosojne 
< < < Axyeulpso 
HYd eu} uonesusdwioo PA <> ł 10491 S Jo ued 
0} yeeyeuu abesoq i Mm samnn əfewu eseeoul 
t jewosojne x BS} yey} soueb um 
t Jo uonppy j pauiquioo sı əuəß 
+ oya A x Dig Bululuuayep-siisay 
dvd Hyd Hvd $; me dvd 
t 
YYd dvd xX xX A XxX 


A 0ļ0ld X 001d 
enjoy Əmyəæul] 


84 HUMAN GENE EVOLUTION 


and mitochondrial genes, for example cytochrome c oxidase subunit IV (COX4; 
16q22-qter and mt genome; Lomax et al., 1992). 

Mitochondrial DNA variation has been very informative for the study of the 
evolution of human populations (see Section 2.3.6). The interested reader is 
referred to Stoneking (1996) for a review of the topic. 


2.3.6 The evolution of human populations 


Darwinian Man, though well behaved, at best is only a monkey shaved! 
W.S.Gilbert Princess Ida (1884) 


Discussion of the fossil record of early hominids is outwith the remit of this volume 
and the interested reader is referred to Wood (1992, 1996) for readable préces. 
Similarly, the origins of modern humans and the movement of human populations 
have been well reviewed by a number of authors including Cavalli-Sforza et al. 
(1994), Cavalli-Sforza (1998), von Haeseler et al. (1995), Jones, Martin, and Pilbeam 
(1992) and Lewin (1998). 

There are currently two different views of the origins of modern humans. The 
first, which is not inconsistent with the fossil record, proposes that different pop- 
ulations (‘races’) of Homo sapiens evolved independently from their ancestor Homo 
erectus in different parts of the Old World (‘multiregional model’). Migration of H. 
erectus from Africa to the rest of the Old World may have occurred ~1 Myrs ago. 
The alternative hypothesis is that H. sapiens arose once in Africa (‘Out of Africa 
model’) and that the species may have been subjected at some stage to a severe 
population bottleneck. With both models, the geographical separation of popula- 
tions has led to the emergence of morphological differences although continued 
gene flow between populations has served to ensure that human gene diversity 
remains graded rather than discrete. 

One of the most dramatic demonstrations so far of the utility of molecular 
genetics to the study of human gene evolution has been the analysis of mitochon- 
drial DNA (mtDNA) from a fossil Neanderthal specimen (Krings et al., 1997). 
Neanderthals and humans are considered to have shared a common ancestor 
between 550 000 and 690 000 years ago. Since the Neanderthal mtDNA sequence 
was found to be quite distinct from those of modern humans, it would appear as 
if Neanderthals became extinct without contributing mtDNA to human popula- 
tions, a finding consistent with the Out of Africa hypothesis. 

The question of the age of the human gene pool has been approached by study- 
ing variation associated either with the mitochondrial genome or with Y chromo- 
some-derived DNA sequences (mtDNA exhibits a high rate of evolutionary 
change and, along with most of the Y chromosome, is transmitted without recom- 
bination). Studies of mtDNA have yielded dates of the order of 200 000 years ago 
for the origin of modern humans (Hey, 1997; Horai et al., 1995; Ruvolo et al., 
1993; Vigilant et al., 1991), broadly consistent with estimates derived from Y- 
chromosome-derived DNA sequence data (Agulnik et al., 1998; Brookfield, 1995; 
Dorit et al., 1995; Fu and Li, 1996; Hammer, 1995; Jobling, 1996; Mitchell and 
Hammer, 1996; Weiss and von Haeseler, 1996). Use of intronic variation in the 
ZFX (Xp21.3-p22.2) gene yielded a figure of 306 000 years (Huang et al., 1998) 
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that is almost certainly discrepant since the 95% confidence interval was 
extremely broad. Using 30 microsatellite loci to construct a phylogenetic tree for 
14 different human populations, Goldstein et al. (1995) estimated that the time 
since divergence of African and nonAfrican populations was 156 000 years (95% 
CI, 75 000-287 000 years). The above estimates are broadly compatible with those 
derived from polymorphisms associated with the CD4 (12pl2-pter) gene 
(Tishkoff et al., 1996), microsatellite data (Nei and Takezaki, 1996) and protein 
polymorphism data (Nei 1995; Nei and Takezaki, 1996). It must be remembered, 
however, that these results simply reflect the time elapsed since the most recent 
common ancestor for the sample population rather than the most recent common 
ancestor of all humans. Small populations and/or population bottlenecks 
(Ambrose, 1998) will have served to obscure the actual timing of the ‘origin’ of 
modern humans. 

As to the place of origin of modern humans, Wainscoat et al. (1986) claimed that 
the relative frequencies of haplotypes of the B-globin (HBB; 11p15.5) gene in 
African and nonAfrican populations provided evidence for a migration out of 
Africa by a fairly small population. Further evidence for a recent African origin for 
modern humans comes from the observation that African populations have a ~20% 
greater microsatellite sequence diversity as compared with Asian and European 
poulations (Armour et al., 1996; Bowcock et al., 1994; Jorde et al., 1997; 1998; Nei 
1995; Perez-Lezaun et al., 1997; Tishkoff et al., 1996). Studies of mitochondrial 
DNA have also shown that there is greater genetic diversity between African popu- 
lations than among Asian or European populations (Comas et al., 1997; 
Merriwether et al., 1991). Indeed, these authors showed that genetic variation 
among humans on all continents are subsets of the variation present in Africans. 
However, some polymorphism lineages do not show deep branches for African pop- 
ulations which has made the Out of Africa hypothesis somewhat contentious (Jorde 
et al., 1995; Harding et al., 1997). The higher level of genetic diversity manifested by 
African populations may simply be a reflection of their greater population size over 
the last million years (Relethford and Harpending, 1995). 

Y chromosome variants appear to be more highly clustered geographically than 
those of mtDNA (Cavalli-Sforza and Minch 1997; Ruiz-Linares et al., 1996; 
Underhill et al., 1997). One explanation for this difference could be that male 
migration has been more limited than that of women (Seielstad et al., 1998). 

About 84% of human genetic diversity exists as differences between individuals 
within populations but the remaining 16% can be used to distinguish between 
populations (Barbujani et al., 1997). By comparison with apes, the extent of the 
genetic variation exhibited by modern humans is relatively low (Ferris et al., 
1981; Jorde et al., 1998; Li and Sadler 1991). This lack of genetic diversity is likely 
to be a reflection of long-term small population size [before the introduction of 
agriculture 10000 years ago, the entire human population probably did not 
exceed 100 000 and is thought to have been around 10 000 for most of its history 
(Erlich et al., 1996; Harpending et al., 1998; Takahata, 1993; Zietkiewicz et al., 
1998)], the effects of past population bottlenecks and the explosive population 
growth particularly during the last 10 000 years (Ambrose, 1998; Cavalli-Sforza et 
al., 1993; Harding et al., 1997; Knight et al., 1996; Reich and Goldstein, 1998; 
Xiong et al., 1991). 
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In human history, demographic expansions have often occurred as a result of 
technological developments affecting food availability and transportation fuelled 
by the pursuit of military or economic objectives. In terms of establishing genetic 
differences between populations, linguistic barriers have also been important 
(Cavalli-Sforza 1997). 


2.3.7 The action of natural selection in human populations 


From the time when the ancestral man first walked erect, with hands freed 
from any active part in locomotion, and when his brain-power became suffi- 
cient to cause him to use his hands in making weapons and tools, houses and 
clothing, to use fire for cooking, and to plant seeds or roots to supply himself 
with stores of food, the power of natural selection would cease to act in pro- 
ducing modifications of his body, but would continuously advance his mind 
through the development of its organ, the brain. 

Alfred Russel Wallace Darwinism (1889) 


Evidence for the recent effects of natural selection on human populations comes 
indirectly from observed variation in allele frequencies (Cavalli-Sforza et al., 
1994) with infectious disease often serving as the selecting agent (Hill 1996a; 
1996b; Levin et al., 1999). The classical example of pathogen-driven selection is 
the heterozygote advantage accruing to carriers of the Glu6—Val sickle cell muta- 
tion in the B-globin (HBB) gene which confers resistance to Falciparum malaria 
(Vogel and Motulsky, 1997). It has been suggested that the high frequency of cer- 
tain diseases in specific populations may be explained in similar ways (Motulsky 
1995; Zlotogora 1994). 

Heterozygote advantage has been invoked to account for the spread of the com- 
mon cystic fibrosis mutation (F508) in the CFTR (7q31.3) gene in Caucasian 
populations. The basis for heterozygote advantage was proposed to be increased 
fitness of heterozygous carriers during cholera epidemics (Gabriel et al., 1994). 
However, it is difficult to see how such overdominant selection could have 
brought about the extremely high prevalence of one specific CFTR lesion (6F508) 
relative to the large number of alternative CFTR mutations known. It would also 
be difficult to explain the gradient in F508 frequency across Europe unless there 
was also a gradient of selective pressure. Thus, the most parsimonious explana- 
tion is probably genetic drift. 

A large body of data has accumulated on HLA polymorphism and its relation- 
ship to disease susceptibility, resistance and progression (Bodmer, 1996; Hall and 
Bowness, 1996). Polymorphic alleles at both HLA class I and II loci have been 
shown to be under selection (Begovich et al., 1992; Hill et al., 1991; Hughes 1988, 
1989). One recent example is selection for specific HLA class II (HLA-DR and 
HAHLA-DQB; 6p21.3) alleles as a result of hepatitis B virus infection (Thursz et al., 
1997). More than 90% of the 135 known HLA-DRB1 alleles appear to have been 
generated since the divergence of human and chimpanzee (Bergström et al., 
1998); such changes appear to have arisen both by point mutation and by gene 
conversion and are consistent with the existence of substantial selective pressure. 

Pathogen-driven selection may also have been responsible for increasing the fre- 
quency of genetic variants at other loci, for example the CCR2 (3p21; Smith et al., 
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1997) and CCRS (3p21; Dean et al., 1996; Libert et al., 1998) chemokine receptor 
genes and the stromal cell-derived (SDF1; 10q11.2; Winkler et al., 1998) gene. 
With the advent of the HIV epidemic, these variants may become advantageous in 
that they appear able to restrict HIV-1 infection and decrease the progression of 
HIV-1 infection to AIDS. Genetic susceptibility to parasitic infections may be 
under oligogenic control (e.g. Leishmania, Shaw et al., 1995; Schistosoma mansoni, 
Marquet et al., 1996; Mycobacterium tuberculosis, Bellamy et al., 1998; 1999; Shaw et 
al., 1997; Plasmodium falciparum; Rihet et al., 1998) implying that a number of dif- 
ferent variants at different loci may be subject to selection. Infectious and parasitic 
disease is currently estimated to kill up to 20 million people in the world every year 
with acute respiratory infection, tuberculosis, diarrhoea, malaria, measles, hepati- 
tis B, whooping cough and tetanus responsible for 3/4 of this toll. Clearly, it is 
likely that selection will continue to operate on existing genetic variation at a wide 
range of genetic loci that serve to determine both the host’s susceptibility and 
resistance to the disease in question. 

Polymorphism of the human ABO blood group system (ABO; 9q34) may owe 
its origins to balancing (overdominant) selection mediated by infectious agents 
(Eder and Spitalnik, 1997). Intriguingly, the same substitutions that differenti- 
ate the A and B alleles in humans are also present in the great apes and Old 
World monkeys (see Chapter 1, section 1.2.2) leading to speculation that these 
polymorphic antigens may have arisen early in primate evolution (Kominato et 
al., 1992; Matinko et al., 1993). However, intronic sequence data point instead to 
an origin for the human alleles about 3 Myrs ago (O’Huigen et al., 1997) which 
argue for a model of convergent evolution of the blood group antigens in the 
higher primates. 

Whilst pathogen-driven selection may be an important factor in increasing 
the frequency of certain genetic variants at specific human loci, susceptibility to 
infectious disease is unlikely to be the sole means by which natural selection 
influences allele frequencies. One common genetic variant not thought to be 
associated with infectious disease has been found in factor VII. Plasma levels of 
factor VII vary significantly in the general population (Howard et al., 1994) and 
are known to be influenced by a number of different environmental factors 
including sex, age, cholesterol, and triglyceride levels (Scarabin et al., 1996). An 
Arg/Gln polymorphism at residue 353 of factor VII (F7; 13q34) which occurs 
with a frequency of about 10% in various populations, is associated with a 
20-25% reduction in the level of plasma factor VII activity as a result of the 
impaired secretion of the Gln variant (Cooper et al., 1997). This high frequency 
is suggestive of a balanced polymorphism and could indicate that the Gln vari- 
ant confers some benefit, for example protection against thrombosis, myocar- 
dial infarction or arterial disease (Escoffre et al., 1995). In support of this 
postulate, Silveira et al. (1994) have shown that the Gln allele is associated with 
a reduction in the amount of activated factor VII (FVIIa) generated in response 
to fat intake; individuals with the Arg/Gln genotype were found to possess 
FVIla levels 48% of that exhibited by individuals homozygous for the Arg 
allele. Interestingly, a decanucleotide insertion polymorphism at -323 in the F7 
gene promoter has been shown to be associated with a 33% reduction in pro- 
moter activity in vitro and a lower level of plasma factor VII activity and antigen 
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in vivo (Pollak et al., 1996; Humphries et al., 1996). The occurrence of two func- 
tionally significant polymorphisms in the same gene is consistent with the idea 
of a selective advantage accruing to individuals with reduced factor VII activity. 

A similar hemostatic system polymorphism is factor V Leiden. This variant, 
which underlies the phenomenon of activated protein C resistance, results from 
the substitution of Arg506 by Gln in coagulation factor V (F5; 1q23). Factor Va 
serves as a cofactor in the activation of prothrombin by factor Xa and the factor V 
Leiden variant is relatively resistant to activated protein C-mediated inactivation. 
Between 1% and 7% of the Caucasian population possess the factor V Leiden 
mutation (Cooper and Krawczak, 1997) which may therefore be regarded as a 
fairly frequent polymorphism with phenotypic effect. Since the factor V Leiden 
mutation is also associated with a relative risk of ~6.0 for venous thrombosis, this 
is also a polymorphism with clinical effect. Why is this factor V variant so com- 
mon? Its high frequency in the general population suggests that it confers, or has 
conferred, some selective advantage on its bearers. Dahlback (1994) speculated 
that a slight hypercoagulable state associated with possession of the factor V 
Leiden variant might have been advantageous in certain situations such as trau- 
matic injury and childbirth. Consistent with this postulate, Lindqvist et al. (1998) 
have shown that carriers of the factor V Leiden variant have a significantly 
reduced risk of bleeding during childbirth. It may be that other common poly- 
morphic variants in hemostatic factor genes e.g. the G20210A transition in the 3’ 
untranslated region of the prothrombin (F2; 11p11-q12; Zivelin et al., 1998) gene, 
are explicable by similar models. 

Other examples of polymorphic variants which may have conferred a selective 
advantage on carriers are the ‘insertion’ (I) allele of the angiotensin-converting 
enzyme (DCP1; 17q23) gene which appears to be associated with improved 
human endurance (Montgomery et al., 1998) and the Glu487/Lys polymorphism 
in the aldehyde dehydrogenase 2 (ALDH2; 12q24) gene which is associated with 
alcohol sensitivity and alcohol avoidance (Goedde et al., 1992). Selection is likely 
to have also operated on a variety of other human characteristics including cogni- 
tive ability (McClearn et al., 1997), body form, skin pigmentation (Smith, 1993) 
and pharmacogenetic variation (Kalow, 1997). 


2.4 Sequencing the genomes of model organisms and 
humans 


The characterization of the genomes of a number of different and disparate 
species should aid significantly our understanding of the human genome, its 
structure, function and evolution. Such species (‘model organisms’) include a bac- 
terium (Escherichia coli), a yeast (Saccharomyces cerevisiae), a nematode 
(Caenorhabditis elegans), the fruitfly (Drosophila melanogaster), the pufferfish (Fugu 
rubripes), the mouse, and the rat. Sequencing of the genomes of these model organ- 
isms is essential for the discovery, description and characterization of all genes 
within those genomes and the proteins that these genes encode. It will provide 
information not only on the chromosomal organization of genes and gene families 
but also on the elements that control their expression. 
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The 4.64 Mb genome of E. coli has been sequenced and encodes 4288 protein- 
coding genes (Blattner et al., 1997). By comparison, the 12.1 Mb genome of S. 
cerevisiae, which is organized into 16 chromosomes, contains 5885 protein-cod- 
ing genes, ~140 ribosomal RNA genes, 40 snRNA genes and 275 tRNA genes 
(Goffeau et al., 1996). The Saccharomyces Genome Database is available online at 
http://genome-www.stanford.edu/Saccharomyces/. 

The entire 97 Mb genome of C. elegans has also been sequenced and represents 
the first fully characterized genome of a multicellular eukaryote (C. elegans 
Sequencing Consortium, 1998). This sequence predicts a total of 19 099 protein- 
coding genes and at least several hundred further genes specifying noncoding 
RNAs. At least 36% of C. elegans proteins exhibit a match in humans whilst 74% 
of characterized human proteins exhibit a match with a C. elegans protein. Each 
C. elegans gene has an average of five introns and the exons constitute some 27% 
of the nematode genome. Sequence data are available through the C. elegans 
Genome Project Website at http://www.sanger.ac.uk/Projects/C_elegans/. 

Comparison of the complete gene/protein sets of yeast and nematode has 
revealed that for a substantial proportion of the two organisms’ genes, one-to-one 
orthologous relationships are identifiable (Chervitz et al., 1998). This suggests 
that the functions of many gene products were already established in the common 
ancestor of fungi and the metazoa. By contrast, most of the C. elegans signaling 
and regulatory genes that are known or expected to be involved in multicellular- 
ity have no yeast orthologue even though they may contain domain sequences 
present that are in yeast (Chervitz et al., 1998). 

The possession of the complete genome sequences of various model organisms is 
proving of enormous benefit in identifying the human homologues of genes that 
are shared between these organisms and humans. Thus, the expressed sequence 
tags (EST) database (dbEST; http://www.ncbi.nlm.nih.gov/dbES T/index.html) 
can be screened using model organism genes as ‘probes’. An example of this 
approach (termed ‘cyberscreening’ or ‘in silico cloning’) is provided by the cloning 
of five human orthologues of yeast genes encoding proteins of the mitochondrial 
respiratory chain complex (Petruzzella et al., 1998). 

Within the next 5 years, the 3200 Mb human genome sequence should also 
become available (Rowen et al., 1997). This will permit integration of cytogenetic, 
genetic, physical and transcriptional maps of the genome, information on inter- 
individual polymorphic variation, and the genotype-phenotype relationship par- 
ticularly in the context of complex traits. Progress made in mapping and 
sequencing the human genome may be followed at the following websites: 
http://www.ncbi.nlm.nih.gov/genemap98/ (National Center for Biotechnology 
Information), http://www.sanger.ac.uk/HGP (Sanger Centre), http://genome. 
wustl.edu/gsc/index.shtml (Washington University Genome Sequencing Center), 
http://www-seq.wi.mit.edu/ (Whitehead Institute/MIT Genome Sequencing 
Project) and http://www.jgi.doe.gov/ (Joint Genome Institute). The availability of 
the human genome sequence will lead to the identification of novel genes encod- 
ing new proteins and the characterization of disease genes which should provide 
new insights into mechanisms of disease. Comparative genome mapping will also 
provide important insights into the evolution of the mammalian genome, its chro- 
mosomal architecture and its genes and gene families. 
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PART 2 
EVOLUTION OF GENE 
STRUCTURE 


Introns, exons, and 
evolution 


3.1 Intron structure, function and evolution 


Eukaryotic genes are not usually contiguous entities but are instead interrupted 
by noncoding sequences termed introns. How were introns first acquired? 
Although there is at present no clear unambiguous answer to this question, there 
appears to be a continuous evolutionary line from the archaic self splicing group 
II introns of eubacteria and simple eukaryotes (Belfort, 1993; Lambowitz and 
Belfort, 1993) via protein-assisted self-splicing introns to the splicing exhibited 
by extant eukaryotic nuclear protein-encoding genes. It is therefore reasonable to 
speculate that group II introns could have been introduced by the invasion of 
eukaryotic cells by the bacterial endosymbiotic ancestor of the mitochondrion 
(Cavalier-Smith, 1991). 

Whatever their origin, a close correlation exists between intron density and 
developmental complexity. Thus introns are largely, although not exclusively, 
confined to the eukaryotes. The archaebacteria, which form a third major taxon 
distinct from the eukaryotes and eubacteria, possess small introns in their tRNA 
and rRNA genes whilst eubacteria and a few eukaryotes possess introns that catal- 
yse their own splicing. An average of one intron per kilobase (kb) of coding 
sequence is found in simple eukaryotes such as Dictyostelium and Plasmodium, 3—4 
per kb in plants and fungi and 6 per kb in vertebrates (Palmer and Logsdon, 
1991). The average cellular gene in vertebrates contains about seven introns 
(Sharp, 1994). Intron size also increases with phylogenetic complexity with 
intronic sequence accounting for only 10-20% of primary transcripts in the pro- 
tista but as much as 95% in vertebrates (Cavalier-Smith, 1985). 

As far as human genes are concerned, introns are the rule rather than the excep- 
tion (Chapter 1, section 1.2.1). Some human genes have nevertheless been found 
which lack introns. These include the sex determining (SRY; Yp11.3) gene 
(O’Neil et al., 1998), the POU domain transcription factor POU3F2 (6q16) gene 
(Atanasoski et al., 1995), the thrombomodulin (THBD; 20p11.2) gene (Jackman et 
al., 1987), the B2-adrenergic receptor (ADRB2; 5q32-q34) gene (Kobilka et al., 
1987), the JUN proto-oncogene (1p31-p32; Hattori et al., 1988), the recombina- 
tion-activating genes RAGI and RAG2 (11p13), the arylamine N-acetyltrans- 
ferase (AACI; 8p21.3-p23.1) gene (Grant et al., 1989), the 13 genes of the 
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interferon-a (IFNA) gene cluster and the interferon-B gene (IFNB1) at 9p21 
(Diaz et al., 1994), the heat shock 70 kDa protein genes (HSPA2, 14q22-q24; 
HASPAIA, 6p21.3; Milner and Campbell 1990), the pyruvate dehydrogenase Ela 
subunit (PDHA1; Xp22) gene (Dahl et al., 1990), the formyl peptide receptor 
(FPR1; chromosome 19) gene (De Nardin et al., 1992), the serotonin receptor 
genes HTRID (1p34.3-p36.3) and HTRIF (Demchyshyn et al., 1992; Adham et 
al., 1993), the casein kinase 2-a1 subunit (CSNK2A1; 20p13) gene (Devilat and 
Carvallo, 1993), the JD2 gene (2p25) encoding an inhibitor of DNA binding 
(Kurabayashi et al., 1993), the purinergic receptor (P2RY1; chromosome 3) gene 
(Ayyanathan et al., 1996) and the calmodulin-like genes (CALML3, 10p13-pter; 
CALMLI, 7p13-pter; Berchtold et al., 1993; Koller and Strehler, 1993). 
Intriguingly, >90% of the human genes that encode the G-protein-coupled recep- 
tor family are intronless, a finding which may reflect a retrotranspositional origin 
prior to gene duplication (Gentles and Karlin, 1999). 

Introns in human genes vary enormously in size, from as little as 24 bp in the 
case of the parvalbumin (PVALB; 22q12-q13) gene, to in excess of 600 kb in the 
case of the «1(V) collagen (COLSA1; 9q34.2-q34.3) gene (Takahara et al., 1995). 
Minimum intron size may be determined by the need to prevent steric hindrance 
between splicing factors. Intron size is not usually well conserved between orthol- 
ogous genes. Thus during the evolution of the higher primates, intron 8 of the 
lamin B (LMNB2; 19p13.3) gene increased very dramatically in size as a result of 
a repeat expansion (de Stanchina et al., 1997), whilst intron 6 of the Ewing sar- 
coma breakpoint region 1 (EWSR1; 22q12) gene expanded progressively through 
successive retrotransposition and recombination events involving Alu sequences 
(Zucman-Rossi et al., 1997). Similarly, the paralogous murine phospholipase D1 
and D2 genes differ in size as a result of a 20-fold expansion/contraction of intron 
size in one of the genes (Redina and Frohman, 1998). There are, however, notable 
exceptions to the rule of lack of conservation; for example, the sizes of the 53 
introns of the human and mouse @l1(II) collagen (COL2A1; 12q12-q13.2) genes 
differ on average by only 13% (Ala-Kokko et al., 1995). 

In the context of intron size, one enigma is the pufferfish (Fugu rubripes) whose 
400 Mb genome is 1/7 the size of the human genome, is relatively devoid of repet- 
itive DNA, and contains comparatively short introns (75% are less than 120 bp in 
length) (Brenner et al., 1993). A dramatic example of the economy manifested by 
the Fugu genome is provided by the relative size of intron 7 of the Duchenne 
muscular dystrophy (DMD; Xp21) gene in Fugu (2.4 kb) as compared to its 
human counterpart (109.6 kb) (McNaughton et al., 1997). At least 40% of the 
human intron is made up of LINE elements, Alu sequences, THE-1 and related 
LTR sequences, interspersed repeat sequences, a mariner transposon and other 
repeats including microsatellites whose insertion served to double the size of the 
intron over the last 130 Myrs (McNaughton et al., 1997). This example is not 
unrepresentative of the size differences noted between orthologous human and 
Fugu introns; thus the neurofibromatosis type 1 (NF1; 17q11) gene spans only 27 
kb in Fugu as compared to 335 kb in human (Kehrer-Sawatzki et al., 1998). 
Whether or not the size of the Fugu genome represents the ancestral state of the 
early vertebrate genome is unclear but its size may well approach the minimum 
sustainable for a vertebrate. We can only speculate as to the possible reasons why 
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selection has acted on a genome-wide basis to minimize intron size in this 
species. Perhaps the Fugu genome has been resistant to colonization by transpos- 
able elements and so has no endogenous source of reverse transcriptase to aid the 
process of retrotransposition. 

Since a high proportion of human genes belong to gene families or super-fami- 
lies which have arisen through a process of duplication and divergence (Chapter 
4), it is hardly surprising that the locations of introns are often evolutionarily con- 
served. Such conservation may be evident in terms of the structures of ortholo- 
gous genes, for example the phenylalanine hydroxylase (PAH; 12q22-q24) genes 
of Drosophila and human (Ruiz-Vazquez et al., 1996) or the myoglobin gene of the 
abalone Sulculus diversicolor and the human gene encoding indoleamine 2,3-dioxy- 
genase (IDO; 8p11-p12; Suzuki et al., 1996). It may also be apparent between par- 
alogous genes, members of the same gene family within a given species, for 
example between the highly homologous human monoamine oxidase A and B 
genes (MAOA, MAOB; Xp11.23; Grimsby et al., 1991) or between the human 
gene encoding the o-subunit of the granulocyte-macrophage colony stimulating 
factor receptor (CSF2RA; Xp22.32) and other members of the cytokine receptor 
family (Nakagawa et al., 1994). However, it is also clear that some evolutionarily 
related genes can possess quite divergent intron distributions, for example those 
encoding the actin-regulatory proteins, gelsolin (GSN; 9q33), the villins (VILI, 
2q35-36; VIL2, 6q22-q27) and the capping protein Cap G (CAPG; 2cen-q24) 
(Mishra et al., 1994), or the complement proteins C6 (chromosome 5), C7 (chro- 
mosome 5) and C9 (Sp12-p14) (Hobart et al., 1993) or the fibrinogens (FGA, FGB, 
FGG; 4q31) (Crabtree et al., 1985). A particularly dramatic example of post-dupli- 
cation structural divergence is provided by the human microfibril-associated gly- 
coprotein genes MFAP2 (1p35-p36) and MAGP2 (12p12-p13). These two genes 
comprise 9 and 10 exons respectively, but sequence and structural conservation as 
well as conservation of intron location is confined to exons 8 and 9 of the MFAP2 
gene and exons 7 and 8 of the MAGP2 gene (Hatzinikolas and Gibson, 1998). 
These exons encode the first six of the seven precisely aligned cysteine residues at 
the center of both proteins. Divergent exon distribution can arise through intron 
insertion and deletion (Section 3.5). Alternatively, it can also arise by recombina- 
tion, an example of this is provided by the human prosaposin (PSAP; 10q22.1) 
gene. The PSAP gene is polycistronic, encoding a precursor for four saposins (A, 
B, C, and D). Saposins A, B, and D are encoded by three exons, saposin C by only 
two. Analysis of intron locations has indicated that the PSAP gene evolved by 
two duplication events and at least one rearrangement (Rorman et al., 1992). This 
rearrangement is thought to have involved a double crossover, between the first 
and second and between the second and third intron positions of the saposin B 
and C coding regions, after the introduction of introns (Rorman et al., 1992). 

Intron sequences themselves are not usually well conserved during evolution 
(Sharp, 1994), a finding which is not surprising in view of the fact that most 
intronic sequence is likely to be nonfunctional. However, there are some notable 
exceptions: the single intron of the oligodendrocyte-myelin glycoprotein (OMG; 
17ql1-ql2) gene exhibits 75% overall homology between human and mouse 
(Mikol et al., 1993) whilst the 53 introns of the COL2A1 (12q12-13.2) gene differ 
by 69% between the two species (Ala-Kokko et al., 1995). Sequences in the third 
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intron of the y-actin (ACTGI; 17p11-qter) gene are highly conserved between 
human and Xenopus (Erba et al., 1988). Perhaps the most dramatic example of the 
conservation of intronic sequence is that evident in a comparison of 97 kb of the 
human and murine T-cell receptor &/8 (TCRA, TCRD; 14q11.2) gene locus which 
exhibits 66% overall sequence homology even though <6% corresponds to exonic 
sequence (Koop and Hood, 1994). 

There are exceptions to the general rule that intron sequences evolve more 
rapidly than exons. For instance, the second exons of the human semenogelin 
genes (SEMG1, SEMG2; 20q12-q13.1) have evolved very rapidly when compared 
to their rat homologues, and more so even than the flanking introns (Lundwall 
and Lazure, 1995). Similarly, the divergence between the red and green visual pig- 
ment (GCP, RCP; Xq28) genes in humans, chimpanzees and baboons is lower in 
intron 4 than in exons 4 and 5 of these genes (Shyue et al., 1994; Zhou and Li, 
1996). In this case, homogenization of intron 4 sequences is thought to have been 
brought about by gene conversion (see Chapter 9, section 9.5) whilst selection has 
probably acted so as to confine gene conversion to the intron in order to retain the 
distinct functions of exons 4 and 5 between the two genes. 

Evolutionary conservation may imply function but we are only just beginning 
to elucidate the function of conserved sequences within introns. The introns of 
some genes contain the coding sequences of other genes (see Chapter 1, section 
1.2.1). Thus, the OMG, EVI2A, EVI2B genes occur within intron 27b of the 
human neurofibromatosis type 1 (NF1; 17q11.2) gene and are transcribed in the 
opposite direction to the NF/ gene. An orthologue of the EVI2B gene is present 
in the corresponding Fugu intron but this intron reveals no trace of the OMG or 
EVI2A genes (Kehrer-Sawatzki et al., 1998). This indicates that the EVI2B gene 
must have been inserted into the NFI gene more than 450 Myrs ago whilst the 
OMG gene must have been a more recent acquisition. 

Some intronic sequence motifs perform transcriptional regulatory functions. 
Positive regulatory (enhancer-like) elements have now been reported from the first 
introns of a considerable number of human genes, for example the genes encoding 
type X collagen (COLIOA1; 6q21-q22.3; Beier et al., 1997), type I 3B-hydroxys- 
teroid dehydrogenase (HSD3B2; 1p13.1; Guerin et al., 1995), tissue inhibitor of 
metalloproteinase 1 (TIMP1; Xp11.23-p11.3; Clark et al., 1997), factor IX (F9; 
Xq27; Kurachi et al., 1995), heat shock protein 90 (HSPCB; 6p12; Shen et al., 
1997), Bruton’s tyrosine kinase (BTK; Xq21.3-q22; Rohrer and Conley 1998), 
purine nucleoside phosphorylase (NP; 14q13.1; Jonsson et al., 1992), O6-methyl- 
guanine DNA methyltransferase (MGMT; 10q26; Harris et al., 1994), type I colla- 
gen (COLIAI; 17q21.3-q22; Hormuzdi et al., 1998), IgE receptor (FCERIB; 
11q13; Lacy et al., 1994), growth hormone (GH1; 17q22-q24; Slater et al., 1985; 
Kolb et al., 1998), thymidylate synthase (TYMS; 18p11.32; Takayanagi et al., 1992) 
and dystrophin (DMD; Xp21.2; Klamut et al., 1996). Although enhancer 
sequences are most commonly found within the first intron of genes, such 
sequence elements are also occasionally found in other locations, for example the 
second intron of the human apolipoprotein B (APOB; 2p24) gene (Rosby et al., 
1992), the third intron of the human oxytocin receptor (OXTR; 3p26) gene 
(Mizumoto et al., 1997) and intron 8 of the 6-aminolevulinate synthase 2 (ALAS2; 
Xp11.21) gene (Surinya et al., 1998). Negative regulatory elements (repressors) 
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within introns have also been characterized, for example in the first introns of the 
human Bruton’s tyrosine kinase (BTK; Xq21.3-q22; Rohrer and Conley, 1998) and 
acid maltase (GAA; 17q25.2-q25.3; Raben et al., 1996) genes. A highly conserved 
CCAAT element within intron 1 of the proliferating cell nuclear antigen (PCNA; 
20p12) gene also appears to act as a negative regulatory element (Alder et al., 1992). 

Other sequences within introns may affect nucleosome formation (Denisov et 
al., 1997), or play a regulatory role in the processing of the primary transcript 
either by modulating mRNA splicing or by influencing mRNA stability through 
RNA-DNA, RNA-RNA or RNA-protein interactions (Mattick, 1994). For exam- 
ple, sequences within the second intron of the human B-globin (HBB; 11p15.5) 
gene are important in promoting efficient 3’ end formation and appear to be 
essential to the stability of the cytoplasmic HBB mRNA (Antoniou et al., 1998). A 
further fortuitous function of introns may be to act as ‘lightning conductors’ for 
the retrotranspositional insertion of mobile elements thereby protecting the 
nearby coding sequences from inactivation (Ferlini and Muntoni 1998). A 
glimpse of what surprises introns may have in store for us is to be found within 
the introns of the human U22 host gene. Seven fibrillarin-associated small nucle- 
olar RNAs (U25-U31), with complementarity to different segments of rRNA, are 
encoded within different introns of this gene and are highly conserved between 
human and mouse (Tycowski et al., 1996). The spliced U22 host gene mRNA 
species has by contrast little coding potential, is short lived and is evolutionarily 
poorly conserved. The surprise therefore is that in the U22 host gene, it is the 
‘introns’ rather than the ‘exons’ which appear to specify the functional products. 


3.2 The evolution of alternative processing 


3.2.1 Alternative splicing 


Alternative splicing allows the generation of different mRNAs (and therefore a 
diverse array of protein isoforms) from the same gene (Figure 3.1). It is therefore an 
important mechanism for the tissue-specific or developmental regulation of gene 
expression. The potential evolutionary advantages of alternative splicing have 
been discussed in detail by Smith et al. (1989) and hence will now only be summa- 
rized briefly. Unlike gene duplication or rearrangement, alternative splicing does 
not change the gene structure or copy number and need not therefore be irre- 
versible in genetic terms. Since existing splicing pathways need not necessarily be 
discarded in order to employ new ones, alternative splicing is likely to be particu- 
larly useful as a means to generate protein diversity during early development and 
in very long lived and terminally differentiated cells. The use of alternative splic- 
ing may also facilitate the efficient exploitation of intragenic duplications since if 
the transcript of the newly duplicated gene were to be alternatively spliced, the 
gene could continue to produce the old gene product as well as the new one. 
Although alternative splicing implies the existence of cell and developmental 
stage-specific splicing factors, it is still unclear whether it is a predecessor or a 
refinement of constitutive splicing. Smith et al. (1989) argued that the two 
processes could have evolved simultaneously. If, indeed, many genes have evolved 
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by exon shuffling (Section 3.6), then novel exon combinations may have been 
tested by alternative splicing. If these conferred a selective advantage, then they 
could have been fixed by mutation at the appropriate splice junction rendering 
them constitutive. Consistent with this view, most alternatively spliced exons 
appear to have originated by exon duplication. 

There are now several examples of alternative splicing being evolutionarily 
conserved in orthologous genes. For instance, exon 9a of the proto-oncogene 
MTG8/ETO (CBFA2T1; 8q22) is alternatively spliced in both mouse and human 
(Wolford and Prochazka 1998). A developmental alternative splicing switch, 
involving exon 16 of the erythroid protein 4.1 (EPB41; 1p) gene, occurs during 
mammalian erythropoiesis and this alternative splice also occurs in Xenopus 
(Winardi et al., 1995). The reason for conservation of this switch over 350 Myrs of 
evolution probably lies in the fact that it controls the expression of a 21 amino 
acid peptide required for the protein’s high affinity interactions with spectrin and 
actin that help to regulate erythrocyte membrane stability. The alternative splic- 
ing of exons 2 and 27 of the neural cell adhesion molecule L1 (NCAM1/; 11q23- 
q24) gene has been conserved in human and the puffer fish (Fugu rubripes) 
demonstrating evolutionary conservation of the alternative splicing mechanism 
over some 430 Myrs (Coutelle et al., 1998). In the a-tropomyosin (TPM1/, 15q22) 
gene, alternative splicing to produce tissue-specific isoforms has been conserved 
from Drosophila to human, corresponding to a timespan of at least 700 Myrs 
(Wieczorek et al., 1988). Exon 6’ of the chromatin condensation regulator gene 
(CHC1; 1p36.1) is involved in alternative splicing in both humans and hamsters 
even though this exon encodes 31 amino acids in human but only 13 in hamster 
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(Miyabashira et al., 1994). Alternatively spliced forms of the nuclear factor I/C 
(NFIC) and I/X (NFIX) genes, linked on human chromosome 19p13.3, are evolu- 
tionarily conserved from chickens to humans (Kruse and Sippel, 1994). The 
chicken CP49 gene, encoding a 49 kDa cytoskeletal protein, generates two differ- 
ent mRNA transcripts, one containing a novel exon encoding 49 amino acids of 
helix 1B, the other lacking it (Wallace et al., 1998). The transcript containing this 
exon is however absent from both the orthologous bovine and human (BFSP2; 
3q21-q25) genes. Similarly, alternative splicing of the Wilms’ tumor (WTI; 
11p13) gene is evident in various mammals but not in the pufferfish, Fugu rubripes 
(Miles et al., 1998). Finally, in the fibronectin (FN1; 2q34) gene, where exon EIIIB 
is alternatively spliced in a cell-type-specific manner, the pattern of TGCATG 
repeats in the downstream intron has been evolutionarily conserved across the 
vertebrates suggesting a possible role for these sequences in splice site selection 
(Lim and Sharp, 1998). 

Alternative processing can however also have had a relatively recent origin. 
Insertion of a B2 (SINE) element into the 3’ untranslated region of the murine 
leukemia inhibitory factor receptor (lifr) gene (human equivalent, LIFR; 5p12- 
p13) encoding the soluble form of the leukemia inhibitory factor receptor (LIFR) 
has, by potentiating alternative 3’ mRNA processing and alternative splicing, 
given rise to a truncated mRNA species (relative to the mRNA encoding the 
membrane-anchored LIFR) which encodes soluble LIFR (Michel et al., 1997). In 
the rat, no such retrotranspositional event has occurred and the soluble form of 
LIFR is not found. The potential for alternative processing must therefore have 
arisen in the last 20-30 Myrs. Two species of mRNA encoding the metabotropic 
glutamate receptor subtype 5 (GRMS) gene occur in rat and human which differ 
in terms of the presence of a 96 bp insertion thought to result from alternative 
mRNA processing (Minakami et al., 1993). Another example of the relatively 
recent evolution of alternative splicing has been noted in the lecithin: cholesterol 
acyltransferase (LCAT; 16q22) gene and is present in humans and the great apes 
but not in gibbons or Old World and New World monkeys (Miller and Zeller, 
1997). Known examples of single base-pair substitutions in splice sites of specific 
orthologous gene pairs that are responsible for the evolutionary emergence of 
alternative splicing are discussed in Chapter 7, section 7.5.3. 

One of the most dramatic inter-specific differences in alternative splicing 
involves the human ¢ complex responder (TCP10; 6q27) gene and its murine 
orthologue (Islam et al., 1993). The human TCP10 transcript includes two exons 
not present in mouse transcripts, whilst mouse tcp10 transcripts include up to four 
exons that are not present in the human transcript. It is at present unclear whether 
a given exon has been incorporated into the transcript of one species or lost from 
the other. The recruitment and removal of exons is potentially explicable by single 
base-pair substitutions either eliminating splice junctions or converting intronic 
DNA sequence into novel splice sites. Islam et al. (1993) described this mode of 
evolution as ‘punctuated equilibrium’, a term that is used to refer to evolutionary 
situations in which phenotypic characters appear to leap from one equilibrium 
state to another in a short space of time. These authors suggested that this mode of 
evolution might have been facilitated by the existence of a gene family in which 
individual members possessed redundant functions. One family member would 
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therefore have been free to pass through a nonfunctional state in which several 
base changes accumulated, resulting eventually in a change in splicing phenotype. 
Whether selection or genetic drift could have been responsible was unclear. 
Unfortunately in this context, the authors did not see fit to discuss the intron 
phase of the added/removed exons. 

Evidence for intergenic differences in alternative splicing during evolution has 
also come from the study of paralogous genes. Although the intron-exon distribu- 
tion in the human annexin VI (ANX6; 5q32-q34) gene conforms exactly to that 
found in the human annexin I (ANX/; 9qll-q22) and II (ANX2; 15q21-q22) 
genes, exon 21 of the annexin VI gene (encoding 6 amino acids) is alternatively 
spliced (Smith et al., 1994). Similarly, the human chloride channel gene CLCN6 
(1p36) contains an alternatively spliced 167 bp exon that is absent in the evolu- 
tionarily related CLCNI (7q32-qter), CLCNS (Xp) and CLCN7 (16p13) genes 
(Eggermont 1998). The genetic basis of these examples of gene-specific alternative 
splicing has however not yet been elucidated. Known examples of single base-pair 
substitutions in paralogous genes responsible for the emergence of alternative 
splicing are discussed in Chapter 7, section 7.5.3. 

Alternative processing may become redundant after gene duplication as exem- 
plified by the troponin I gene which encodes one of the three subunits of the tro- 
ponin complex of the thin filaments of vertebrate striated muscle. Invertebrates 
and ascidians possess a single troponin I gene (Hastings, 1997) and this is alter- 
natively spliced to generate proteins that differ in their N-terminal regions. By 
contrast, the three troponin I genes in the human genome (TNNI1J, 1q31; TNNI2, 
11p15; TNNI3, 19q13) and in the genomes of other vertebrates do not undergo 
alternative splicing although the ‘extra’ exon in the TNNIJ3 gene (as compared 
with the TNNI/ and TNNI2 genes) is thought to correspond to the exon which is 
alternatively spliced in ascidians and invertebrates (Hastings, 1997). 

A database of alternatively spliced genes (ASDB) is available online at 
http://cbcg.nersc.gov/asdb and contains information about alternatively spliced 
genes in different organisms and in different tissues. 

Alternative transcripts may also be generated by the differential utilization of 
polyadenylation sites (Figure 3.1). The differential utilization of two distinct 
polyadenylation sites in the plasminogen activator inhibitor 1 (PAIL; 7q22) gene 
(yielding two mRNA species of 2.6 kb and 3.6 kb) has been conserved between 
humans, orangutans and African green monkeys (Cicila et al., 1989). However, 
only one PAI] mRNA species is apparent in lower primates and nonprimate 
mammals (Cicila et al., 1989) consistent with the acquisition of an extra 
polyadenylation site during primate evolution. The evolution of alternative pro- 
cessing may have come about through the differential utilization of polyadenyla- 
tion sites via the newly described mechanism of LINE element-mediated 
recombination described in Section 3.6.1). Finally, alternative transcripts may 
also arise through alternative promoter usage. There is very little information on 
inter-specific differences in promoter site selection: the example of the insulin- 
like growth factor II gene is cited in Section 3.7. 

Herbert and Rich (1999) have stressed the potential importance of RNA process- 
ing in evolutionary processes. Their perspective, which deserves close examination, 
was elegantly summarized in one of their introductory paragraphs: 
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The outcome of one RNA processing event can affect the outcome of others 
resulting in regulatory ‘networks’ that influence both the information content 
and expression of the ‘ribotype’ (RNA pool), which in turn influences pheno- 
type. Alterations in RNA processing generate different sets of ribotypes which 
are subject to natural selection on the basis of the phenotypes they produce. 
The evolutionary process requires the preservation of successful ribotypes in 
an inheritable (sic) form. 


3.2.2 Aberrant transcripts 


There are numerous examples in the literature of aberrantly processed mRNA 
transcripts being produced in addition to correctly processed mRNAs in the cells 
of normal healthy individuals (Berg et al., 1996). It is possible that some of these 
‘aberrant transcripts’ could be of functional significance but in the majority of 
cases they are likely to occur simply as a consequence of the absence of any selec- 
tive pressure to avoid aberrant splicing. One example of this phenomenon is pro- 
vided by the human glutathione peroxidase type 5 (GPX5) gene in which the 
majority of gene transcripts appear to be incorrectly spliced (Hall et al., 1998). If 
taken to its extreme, however, and the extent of aberrant splicing increases 
beyond a certain point, a gene may come to be effectively inactivated as has 
occurred in the case of the human chorionic somatomammotropin (CSHL1; 
17q22-q24) ‘pseudogene’ (Misra-Press et al., 1994). 


3.2.3 MRNA surveillance 


The cell does however possess a mechanism that is capable of removing those aber- 
rant transcripts that encode ‘nonsense mRNAs’ which would otherwise be expected 
to give rise to truncated proteins. Such mRNAs are often inherently unstable as a 
result of a process known as mRNA surveillance which involves nonsense-mediated 
mRNA decay (Culbertson, 1999). mRNA surveillance appears to be ubiquitous in 
eukaryotes with homologous proteins encoded by orthologous genes being found in 
organisms as widely separated as yeast (Upfl, Upf2, Upf3) and human (RENTI; 
19p13; Culbertson, 1999). Presumably this mechanism has evolved in order to 
reduce the cost to the cell of producing non-productive transcripts. 


3.2.4 Ectopic transcripts 


Extremely low levels of correctly spliced mRNA transcripts from tissue-specific 
genes have been demonstrated in supposedly ‘non-expressing’ cell types 
(reviewed by Cooper et al., 1994). Such ectopic or illegitimate transcripts have been 
found to occur at levels as low as one mRNA molecule per 500-1000 cells, > 1000- 
fold lower than the level of an average ‘low abundance’ mRNA. It remains unclear 
whether every cell is capable of generating ectopic transcripts or alternatively if 
the occasional cell in an otherwise non-expressing tissue is able to produce com- 
paratively high levels of the transcript in question. Do ectopic transcripts have a 
biological role? Since the levels involved are often extremely low, it is hard to 
imagine that such a role involves significant protein synthesis. Perhaps what we 
observe simply represents a reasonable balance between the cost to the cell of 
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producing nonproductive transcripts from many ‘non-expressed’ genes and the 
cost of turning off the transcription of many thousands of ‘leaky’ genes com- 
pletely. 


3.3 Exon structure and evolution 


The average cellular gene in vertebrates contains about eight exons (Sharp, 1994). 
A comprehensive analysis of 5061 exons from some 2705 intron-containing human 
genes has yielded a classification scheme for exons according to their transcrip- 
tional or translational boundaries (Zhang, 1998). This scheme (depicted in Figure 
3.2) employs 12 different categories: (1) 5’ terminal untranslated exons, (2) 3’ ter- 
minal untranslated exons, (3) 5’ terminal exons having a 5’ untranslated sequence 
(UTR) followed by a coding sequence, (4) 3’ terminal exons having a 3’ untrans- 
lated sequence (UTR) followed by a coding sequence, (5) internal exons having a 
5’ portion of 5’ UTR followed by coding sequence, (6) internal exons having a cod- 
ing sequence followed by a 3’ portion of 3’ UTR, (7) internal untranslated exons, 
(8) internal translated exons, (9) exons containing the complete coding sequence 
but which do not contain the transcriptional end, (10) exons containing the com- 
plete coding sequence but which do not contain the transcriptional start, (11) 
exons containing the complete coding sequence and both the transcriptional start 
and the transcriptional end, and (12) exons containing the complete coding 
sequence but neither the transcriptional start nor the transcriptional end. 
Although no doubt subject to ascertainment bias, the sample of exons studied by 
Zhang (1998) exhibited the following frequencies by type for the first 8 categories 
of exon: (1) 271, (2) 38, (3) 482, (4) 553, (5) 174, (6) 69, (7) 34, (8) 3440. Exons in cat- 
egories (3) and (7) were in general found to be <100 bp in length whereas exons in 
categories (2) and (4) were mostly 300-500 bp in length. There appears to be virtu- 
ally no minimum size for an internal translated exon (category 8) in human genes; 
the smallest include the 4 bp exon 3 of the skeletal muscle troponin (TNNI/; 
19q31) gene and two 3 bp exons (10 and 14) in the cardiac myosin binding protein 
C(MYBPC3; 11p11.2) gene. As far as the maximum size of an internal exon is con- 
cerned, the human F8C (3106 bp; Xq28), APOB (7572 bp; 2p23-p24) and MUC5B 
(10,690 bp; 11p15.5) genes are currently league leaders. 

Zhang (1998) found that the size of 5’ UTR varied between human genes from 
0 bp to 2077 bp with a mean of 136 bp. The size of the 3’ UTR varies from —2 bp 
to 3427 bp with a mean of 589 bp (Zhang, 1998). The —2 value, for the human a- 
D-galactosidase A (GLA; Xq) gene, is due to the fact that the coding sequence 
including the stop codon ends at nucleotide 11268 whilst the poly(A) addition site 
is located at 11266. 


3.4 Introns early or introns late? 


Were introns used in the assembly of the first genes or were they added only later to 
previously contiguous coding sequences? The ‘introns early’ theory or simply the 
‘exon theory of genes’ proposes that the genes encoding complex extant proteins 
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Figure 3.2. Exon classification. All exons can be classified into these 12 mutually 
exclusive classes. At the top, a schematic gene model is depicted which indicates how 
some types of exons may be organized in a gene (redrawn from Zhang, 1998). For 
explanation, see text. UTR: Untranslated region. CDS: Coding sequence. 


emerged through the coalescence of primordial minigenes (Gilbert, 1987, 1997). 
These minigenes are held to have originally encoded protein modules and are now 
represented as exons whereas the non-coding linker DNA between the minigenes 
has survived as introns. Introns were then lost and novel exons made by fusing 
smaller exons together. By contrast, the ‘introns late’ theory postulates that fully 
functional genes had introns inserted into them at different stages in their evolution 
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(Cavalier-Smith, 1991; Palmer and Logsdon, 1991). At its extremes, the debate 
between these two viewpoints is reminiscent of the old controversies between selec- 
tionists and neutralists. However, perhaps not altogether surprisingly, the debate 
has generated a significant amount of evidence supporting both theories (Gilbert et 
al., 1997; Rogers 1990). 

One prediction of the introns-early hypothesis is that intron location should 
tend to demarcate structural or functional domains within proteins (Blake, 1978; 
Branden et al., 1984; Craik et al., 1983; Go, 1987). This question has proved highly 
contentious (Traut, 1988). The extreme introns-late view is that no such correla- 
tion exists and that introns have been inserted quasi-randomly into the structure 
of genes. Consistent with this standpoint, several genes encoding evolutionarily 
ancient proteins (actins, alcohol dehydrogenase, carbonic anhydrase II, pyruvate 
kinase, globins) appear not to exhibit any correlation between intron position and 
structural features (Stoltzfus et al., 1994; Weber and Kabsch, 1994). However, 
more recently, a computer program designed to predict precisely the locations of 
module boundaries in proteins has helped to demonstrate that there is indeed a 
strong correlation between intron position and structural elements at least in the 
sample of 32 ancient proteins examined (de Souza et al., 1996). The introns 
involved were invariably phase zero (de Souza et al., 1998; see Section 3.6.2). 

As we have noted in Section 3.2, the conservation of intron location is a common 
finding. That intron location is often conserved in evolutionarily ancient genes such 
as glyceraldehyde-3-phosphate dehydrogenase (GAPD; 12p13; Kersanach et al., 
1993), carbamoylphosphate synthetase (CPS1; 2q35; van den Hoff et al., 1995) and 
triose phosphate isomerase (TPI/; 12p13; Gilbert and Glynias 1993; Marchionni 
and Gilbert 1986; McKnight et al., 1986; Straus and Gilbert 1985) is certainly com- 
patible with the introns-early hypothesis. Most informative in terms of the introns- 
early/late debate, however, are discordant cases, examples of evolutionarily related 
genes from different taxa in which either intron position varies from between several 
nucleotides to several codons or where the intron is either present or absent (Rogers, 
1989). The introns-early theorists have explained these discrepancies by invoking 
intron sliding and deletion (Craik et al., 1983) but despite the occasional convincing 
example [e.g. intron 8 of the histidyl-tRNA synthetase (HARS; chromosome 5) 
gene; Brenner and Corrochano, 1996; Figure 3.3; intron 2 of the glucose-dependent 
insulinotropic peptide (GIP; 17q21.3-q22) gene which results in an 8 amino acid 
deletion of the prepropeptide of the rat protein as compared to human; Higashimoto 
and Liddle, 1993], the evidence for the widespread occurrence of intron sliding is 
still rather weak (Yuasa et al., 1997; Stoltzfus et al., 1997). The introns-late view is that 
intron sliding is inherently improbable because such a process would almost 
innevitably involve an intermediate stage that would alter the reading frame thereby 
leading to the loss or inactivation of the protein product. Therefore introns-late 
devotees have regarded discordant intron location, as found in the evolutionarily 
ancient &- and B-tubulin (TUBA1, 2q; TUBA2, 13q11; TUBB, 6p21-pter; Dibb and 
Newman, 1989), aldehyde dehydrogenase (ALDHI, 9p21; ALDH2, 12q24; ALDH3, 
17; ALDHS, 9; ALDH6, 15q26; ALDH9, 1; ALDHI10, 17p11.2; Rzhetsky et al., 
1997) and triose-phosphate isomerase (TPI; 12p13; Kwiatowski et al., 1995; 
Logsdon et al., 1995) genes as evidence for the occurrence of multiple independent 
insertional or deletional events. 
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Several possible relationships may exist between the structural domains and 
the arrangement of exons in the gene (Figure 3.4). In many genes encoding globu- 
lar proteins (e.g. the globins), the exons correspond closely to the structural 
domains thereby ensuring that when exon duplication has occurred, it has given 
rise to domain duplication (Li, 1997). In the majority of cases, however, a more 
complex relationship exists between domains and exons (Li, 1997). Thus it may 
be that the correspondence between exons and domains is only approximate. In 
some cases, an exon may encode two or more domains whilst in others, a single 
domain may be encoded by two or more exons. Finally, there may be no obvious 
correspondence between exons and domains. 

Taking together the results of all the studies so far performed, the introns- 
early and introns-late viewpoints can to some extent be reconciled once we 
accept that they are not necessarily mutually exclusive, even for a single specific 
gene. A relationship between exon distribution and the domain structure of the 
encoded protein may therefore exist for some genes or some introns/domains 
within the same gene/protein. For others, it may be that such a relationship did 
once exist but has decayed with the passage of evolutionary time (Traut, 1988). 
Similarly with intron location, some introns were already present in primordial 
genes, whereas other genes lost or more likely gained (Cho and Doolittle, 1997) 
their introns at later stages of evolution. De Souza et al. (1998) have suggested 


(a) DNA sequence around intron 8 of the histidyl-tRNA synthetase (HARS) gene 
HARS-Fugu TATGTTGGTATGCAAGgtgagattt-—--tctgtagGTGGAATGGATTTGGCTGAACGT 
HARS-Hamster TATGTCCAGCAGCACGGTGAGgt aaa-—--— gctccccagGTGTGTCTGGTAGAGCAG 
HARS-Human TATGTCCAGCAACATGGTGGG GTATCCCTGGTGGAACAG 

(b) Model for the shift of intron 8 


Taking the Fugu intron as the ancestral one, an A to G change (arrowed) creates a cryptic splice site 
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TATGTTGGTATGCAAGgt gaggttt-—--tctgtagGTGGAATGGATTTGGCTGAACGT 


An additional insertion of a G (arrowed) then creates a frameshift (top) which would normally be lethal 
but is rescued by the cryptic splice site becoming functional (bottom) 
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The result is a 5 nucleotide shift in the position of the intron but only a G to E substitution in the 
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Figure 3.3. Intron sliding in the histidyl-tRNA synthetase (HARS) gene (after Brenner 
and Corrochano, 1996). (a) Comparison of the sequence around intron 8 in pufferfish 
(Fugu), human and hamster HARS genes. Lower case letters denote intron sequence, 
upper case letters exon sequence. The segment of the Fugu intron identical to hamster 
exon sequence is given in bold type. (b) Model for the sliding of intron 8. Nucleotide 
changes are in bold type and marked by arrows. 
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that 30-40% of the present day intron locations correspond to phase zero 
introns originally present in the progenote. The remainder, they suggest, corre- 
spond to introns either added or moved and appear equally in all three phases. 
In the next section, the evidence for both the insertion or deletion of introns 
will be examined. 


3.5 Mechanisms of intron insertion and deletion 


As we have seen, the introns-late theory both assumes and requires that introns 
have been inserted and deleted during evolution. One example of the gain of an 
exon during evolution comes from the ten exon human renin (REN; 1q32) gene 
which contains three amino acids encoded by a sixth 9 bp exon not present in the 
mouse gene (Miyazaki et al., 1984). This exon must have been acquired in the lin- 
eage leading to humans rather than lost in the murine lineage because this exon is 
absent from the evolutionarily related pepsin genes in both human (PGA3, 
PGA4, PGAS; 11q13) and pig. We may surmise that the sixth exon in the human 
REN gene was acquired by mutational change in intron 5 at some stage in the last 
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100 Myrs since the divergence of the rodent and human lineages. Another possi- 
ble example of a late insertional event is provided by the human forkhead 
FKHL13 (FKALI13; 17q22-q25) gene which contains a single intron that has 
interrrupted the forkhead DNA-binding domain (Murphy et al., 1997). 

The evolution of the human HLA-DRB6 (6p21.3) gene has been characterized not 
only by the deletion of an exon and the original promoter region but also by the de 
novo creation of an exon. This is thought to have occurred in association with the 
insertion of a retroviral LTR into intron 1 of the gene >23 Myrs ago, prior to the 
divergence of Old World monkeys from the human-ape lineage (Mayer et al., 1993). 
Whether the exon/promoter deletion accompanied or followed the LTR insertion is 
unclear. Whichever, the gene is still capable of being transcribed. That this is so is 
due to the creation of an open reading frame for a new exon by the insertion which 
serendipitously encoded a hydrophobic sequence that was able to function as a leader 
for the truncated HLA-DRB6 protein. The new exon also provided a functional 
donor splice site at its 3’ end which potentiated in-register splicing with exon 2 of the 
HLA-DRB6 gene. Finally, since the LTR also provided a substitute promoter 
region, the downstream HLA-DRB6 gene could be transcribed. 

Retrotransposition of an mRNA intermediate is one mechanism that would serve 
to erase completely the original exon-intron distribution of a gene. One potential 
example of this phenomenon is provided by the 68 kDa neurofilament protein 
(NEFL; 8p21) gene, a member of the intermediate filament multigene family that 
diverged over 600 Myrs ago. Other members of this family include desmin (DES; 
2q35), vimentin (VIM; 10p13), glial fibrillary acidic protein (GFAP; 17q21), and 
the type I and II keratins. Whereas these latter genes possess seven or eight introns, 
six of which occur at homologous locations, the NEFL gene possesses only three, 
none of which correspond in terms of their location to introns in the other known 
intermediate filament genes (Lewis and Cowan 1986). Retrotransposition of a 
cDNA intermediate could account for the present day structure of the NEFL gene 
but three new introns would still have to have been acquired, presumably by inser- 
tion. In other cases of putative intron loss, the retrotransposition of semiprocessed 
mRNAs (Chapter 6, section 6.1.3) should also be considered. Such retrotransposed 
sequences will, by definition, contain fewer introns than their parent genes but the 
retention of some introns will often serve to obscure their origin. 

The best characterized example of the insertion of an intron into a gene is from 
the sex-determining gene which in humans (SRY; Yp11.3) and other placental 
mammals is intronless. However, in dasyurid marsupials, an intron was inserted 
de novo into the Sry gene about 45 Myrs ago (O’Neill et al., 1998). The 825 bp 
intron lies within the coding sequence 550 bp from the start codon and was 
inserted in phase 1 (between the first and second bases of the codon). The intron, 
which contains a repetitive sequence element specific to marsupials, is correctly 
spliced out of the primary transcript. Since the SRY gene is essential for mam- 
malian sex determination, it is very unlikely that intron insertion could have 
inactivated the gene even temporarily. We must therefore conclude that the 
inserted sequence probably contained functional splice junction motifs that 
ensured its accurate removal from the marsupial Sry transcript. 

Some gene families have evolved by multiple rounds of gene duplication accom- 
panied by the gain or loss of both introns and exons, for example the human 
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lipoprotein lipase (LPL; 8p22), hepatic lipase (LEPC; 15q21-q23) and pancreatic 
lipase (PNLIP; 10q24-p26) genes and the homologous yolk protein 1 gene from 
Drosophila (Figure 3.5; Kirchgessner et al., 1989). Another example is provided by 
the human thyroid peroxidase (TPO; 2pter-p24) and myeloperoxidase (MPO; 
17q21.3-q23) genes (Kimura et al., 1989). 

The mammalian hepatocyte nuclear factor la (HNFa; TCF1; 12q24.3) gene 
possesses nine exons, whereas its avian and amphibian counterparts have ten 
(Horlein et al., 1993). This difference is explicable in terms of the loss in mammals 
of intron 9 which in the chicken gene sub-divides the serine-rich transactivation 
domain. Interestingly, there is a sequence at the 3’ end of intron 9 in the chicken 
gene which matches a conserved element used as a joining signal in immunoglob- 
ulin and T-cell receptor genes. Similarly, a perfectly conserved heptamer 
(CACAGTG) box, a recombination signal sequence in immunoglobulin genes, is 
found at the junction of the fused exon 9-exon 10 in the rat HNFo gene. Horlein 
et al. (1993) speculated that V(D)J recombinase may have been involved in the 
excision of intron 9 in the mammalian gene. 

The human surfeit 5 (SURF5; 9q34.1) gene contains an intron in its 5’ 
untranslated region which is not present in the mouse or rat SurfS genes (Duhig 
et al., 1998). This additional intron is also present in apes, Old and New World 
monkeys and the prosimian Galago. Duhig et al. (1998) speculated that the 
intron was introduced after the divergence of primates and rodents but before 
the divergence of the human and prosimian lineages. 

A considerable number of examples are therefore now known of introns 
which have been either gained or lost during evolution as a result of the action 
of several distinct mechanisms. Such examples are certainly supportive of the 
introns-late theory. 


3.6 Exon shuffling 


3.6.1 Exon shuffling in the evolution of human genes 


Exon shuffling may be defined as the transfer of exons, encoding specific functional 
modules, between genes so that the module-associated functions are conferred upon 
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the recipient proteins. One of the first examples of exon shuffling to be recognized 
was from the human low density lipoprotein receptor (LDLR; 19p13.3) gene which 
contains eight contiguous exons encoding epidermal growth factor (EGF)-like 
domains as well as seven repeats corresponding to LDL-binding domains (Stidhof 
et al., 1985a,b). Other examples include the human hemopexin (HPX; 11p15.4- 
p15.5) gene which contains 10 exons each of which encodes a 45 amino acid repeat 
(Altruda et al., 1988) and the human factor XIII b subunit (FI3B; 1q31-q32; 
Bottenus et al., 1990) gene which encodes a protein with ten 60 amino acid comple- 
ment B-type (‘sushi’) domains each encoded by a single exon. A large number of 
other human genes possess multiple copies of exons encoding specific domains. 
These encode proteins of the extracellular matrix (e.g. laminins, collagens), the ser- 
ine proteases of coagulation (see Section 3.6.3), a variety of receptors (e.g. integrins) 
and the immunoglobulins among many others (reviewed by Patthy, 1996). Exon 
shuffling is therefore a widely used evolutionary strategy although it appears to be 
confined to the genes of higher eukaryotes. 

Until recently, exon shuffling was considered to result solely from intron-medi- 
ated recombination (Patthy, 1995; Strelets and Lim, 1995). However, an alterna- 
tive mechanism, LINE element-mediated recombination, now appears to be capable 
of linking previously unlinked genomic DNA segments and may represent a 
novel means to bring about exon shuffling. Moran et al. (1999) have shown that 
LINE elements are capable of transducing exons from a downstream gene to new 
genomic locations thereby creating novel genetic combinations. This mechanism 
requires readthrough of the relatively weak LINE element polyadenylation signal 
by RNA polymerase to yield a processed mRNA transcript containing the LINE 
element fused to one or more exons from the neighboring gene. Readthrough is 
made possible by the presence of more potent polyadenylation signals 3’ to the 
LINE element and Moran et al. (1999) have shown that this can occur with high 
efficiency. The chimeric transcript then serves as a template for reverse transcrip- 
tase and if the resulting cDNA is subsequently integrated into the intron of a 
recipient gene, it may be recruited by that gene. This may help to explain why 3’ 
terminal exons are often much longer than internal exons (Hawkins, 1988). 
Inefficient mRNA cleavage/polyadenylation might however lead to alternative 
splicing (Section 3.2). A possible example of this process occurred about 10 Myrs 
ago during the evolution of the great apes: the co-transduction and integration of 
a LINE element and a fragment containing exon 9 of the cystic fibrosis trans- 
membrane conductance regulator (CFTR; 7q31) gene (Rozmahel et al., 1997). 
Since some 6% of LINE element insertions occur within genes (Moran et al., 
1999), it could be that this mechanism has played an important role in shuffling 
exons between genes, and may therefore have been central to the creation of the 
mosaic structures characteristic of so many genes in higher animals. 


3.6.2 The phase compatibility of introns 


Only a limited proportion of exons can be utilized for exon shuffling since the 
splice junctions of the shuffled exon must be phase compatible with their flank- 
ing exons in order to maintain the reading frame. As we have seen in Chapter 1, 
section 1.2.1, introns are classified as phase 0 if the intron lies between two 
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codons, phase 1 if the intron lies between the first and second nucleotides of a 
codon, and phase 2 if it lies between the second and third nucleotides of a codon. 
The requirement for phase compatibility dictates that only symmetrical exons of 
class 1-1, class 2-2 and class 0-0 [i.e. exons flanked by introns of the same phase 
(1, 2, or 0)] are suitable for duplication or insertion. It is therefore not surprising 
that genes encoding mosaic proteins (proteins which comprise multiple domains 
and which are likely to have evolved by exon shuffling) contain a disproportion- 
ate number of class 1-1 exons encoding a wide variety of different modules, for 
example EGF-like, calcium-binding, fibronectin-like finger, kringle, complement 
B, LDL receptor, von Willebrand, thyroglobulin, C-type lectin etc (Patthy 1991a, 
1994, 1996). Class 1-1 introns predominate in the immunoglobulin, T-cell recep- 
tor and HLA genes, the Thy-1 glycoprotein (THY1; 11q22.3-q23) gene and the 
B2-microglobulin (B2M; 15q21-q22) gene, consistent with exon shuffling having 
been important in their evolution (Patthy et al., 1987). Further examples of this 
phenomenon include the exons encoding the complement B-type domains in the 
human C4b-binding protein a gene (C4BPA; 1q32; Hillarp et al., 1993), the 
immunoglobulin-like and fibronectin type III modules in the human Axl onco- 
gene (AXL; 19q13.1-13.2; Schulz et al., 1993) gene and four of six throm- 
bospondin modules of the human properdin gene (BF; 6p21.3; Nolan et al., 1992). 
Genes with predominantly class 0-0 exons include those encoding the type III 
collagens (see Chapter 4) and B-casein (CSN2; 4pter-q21) whilst class 2-2 exons 
are exclusively found in the glucagon (GCG; 2q36-q37) gene (Patthy, 1987). 

Using a large database of eukaryotic genes, Long et al. (1995) demonstrated there 
to be an excess of symmetrical exons over expectation and estimated that at least 
19% of exons had been involved in exon shuffling. Why the preponderance of class 
1-1 exons? Patthy (1994) suggested that since 92% of introns flanking signal pep- 
tide domains were phase 1, modularization of exported proteins with secretory sig- 
nal peptide domains is likely to have employed class 1-1 introns. According to 
Patthy (1994), modularization proceeds in different stages: (i) insertion of introns 
of identical phase at boundaries of the protein fold, (ii) tandem duplication of the 
symmetrical module by intronic recombination, and (iii) module transfer to a 
novel location (Figure 3.6). Nonsymmetrical exons (those flanked by introns of dif- 
ferent phase) are potentially much less versatile and may be expected to have been 
utilized much less often. Tomita et al. (1996) noted that when introns are inserted 
so as to disrupt codons, the site of insertion occurs much more frequently between 
the first and second bases than between the second and third bases. The reason for 
this remains unclear. 

There are however various instances which do not conform to the above exon 
shuffling rules. In some cases, the original exon-intron organization of genes 
has become eroded with the passage of evolutionary time. Thus, although in the 
human perlecan (HSPG2; 1p35-p36; Cohen et al., 1993) gene, the LDL receptor 
modules and the immunoglobulin-like modules are still flanked by phase 1 
introns, the original introns have been lost from the regions of the gene encod- 
ing laminin A, laminin B and epidermal growth factor-like modules. Similarly, 
in the human complement 6 (C6; chromosome 5; Hobart et al., 1993) gene, 
phase 1 introns are found only at the boundaries of one of the complement B 
modules having been lost from the boundaries of the class 1-1 thrombospondin, 
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Figure 3.6. Different stages in the conversion of a domain to a module (redrawn from 
Patthy 1994). The boxes represent exons separated by introns. The exon encoding the 
signal peptide is solid whilst the exons encoding the protein fold (A) are lightly shaded. 
Stage 1: Insertion of introns of identical phase at amino and carboxy terminal boundaries 
of protein fold. Stage 2: Tandem duplications of symmetrical protomodule A via intronic 
recombination. Stage 3: Module A is transferred to a new location. 


LDL receptor, EGF, and C7 modules through multiple occurrences of intron 
insertion and removal. Finally, the exon shuffling rules do not apply in some 
cases of genes encoding proteins common to both prokaryotes and eukaryotes 
(phosphoglycerate kinase, alcohol dehydrogenase, pyruvate kinase, glyceralde- 
hyde-3-phosphate dehydrogenase, triosephosphate isomerase and dihydrofo- 
latereductase). This is consistent with the view that these evolutionarily very 
ancient genes did not evolve by exon shuffling (Patthy, 1987, 1991a). However, 
Long et al. (1995) claimed that there is an excess of symmetrical exons in the 
ancient conserved regions of eukaryotic genes (regions homologous to prokary- 
otic genes), a finding that is consistent with at least some introns being of 
ancient origin (Gilbert et al., 1997). 


3.6.3 The serine proteases of coagulation 


One of the archetypal examples of the evolution of modular proteins by exon shuf- 
fling is that of the serine proteases of coagulation. As the completed primary 
sequences of hemostatic factors became available, it was noticed that certain 
domains of shared homology recurred many times in diverse proteins. Figure 3.7 
depicts 11 proteins of coagulation and fibrinolysis in such a way as to emphasise 
their modular composition. The kringle domain is present in prothrombin as well 
as three proteins of fibrinolysis (tPA, urokinase and plasminogen) and may play a 
role in fibrin binding. A trypsin-like serine protease domain is common to all the 
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proteolytic enzymes of coagulation. The EGF-like domain was first recognized 
among clotting proteins in factor X. Two copies of this motif are found in all the 
vitamin K-dependent proteins of coagulation (factor IX, factor X, factor VII, pro- 
tein C) except protein S which has four and prothrombin which has none. The 
action of vitamin K in coaguloprotein synthesis is to promote carboxylation of N- 
terminal glutamic acid (Gla) residues by a liver microsomal enzyme system. The 
five proteins which share this N-terminal modification have a homologous 
propeptide domain that functions as a signal to the carboxylase. 

The organization of coagulation factor genes reflects to a considerable extent 
their functional modular assembly as described above. Figure 3.8 shows the gene 
structure of several clotting factors. It is apparent that since factors VII (F7; 
13q34), X (F10; 13q34), IX (F9; Xq26.3-q27.1) and protein C (PROC; 2q13-q21) 
have virtually identical protein and gene structures, they must have emerged rel- 
atively recently by a process of gene duplication and divergence. Similarly, tissue 
plasminogen activator (PLAT; 8p12-q11.2), factor XI (F11; 4q35), factor XII 
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Figure 3.7. Domains of hemostatic proteins (redrawn from Tuddenham and Cooper, 
1994). 
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(F12; 5q33-qter) and urokinase (PLAU; 10q24-qter) exhibit similar gene organi- 
zation with respect to their protease domains but differ from prothrombin (F2; 
11p11-q12) and factors VII, IX, X, and protein C. Kringle domains are encoded 
by a similar gene structure wherever they occur, as are the calcium-binding Gla 
domains. EGF-like domains are encoded by a single exon with type 1 splice junc- 
tions as are the fibronectin type I (Fn I) domains. 

The presence/absence of different modules (Gla and EGF-like domains, 
kringles) in the different coagulation factor proteins can be used to reconstruct 
their evolutionary past. The transfer of modules between proteins has arisen via 
exon shuffling and protein data can be used to construct evolutionary trees 
(Patthy, 1985) that are consistent with what is already known of the genes in terms 
of their exon size and distribution. Patthy (1990) has described at length the prin- 
ciples underlying the assembly of present-day coagulation factor genes from their 
constituent modules. Evolution of these genes has proceeded by repeated inser- 
tions, duplications, exchanges and deletions of modules. How has it been possible 
to produce such a plethora of different proteins/genes by exon shuffling in what 
has been a comparatively short period of evolutionary time? The answer appears 
to lie firstly with the close correspondence between exon boundaries and the mod- 
ular domains of the proteins and secondly with the fact that the ancestors of the 
kringle, fibronectin, growth factor, protease and Gla modules all had phase 1 
introns at both module boundaries. 

The phylogenetic tree of the serine proteases, constructed by Patthy (1990) is 
shown in Figure 3.9. In this diagram, the major division between the blood coagu- 
lation proteases and those involved in fibrinolysis is very apparent. This division 
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Figure 3.8. Organization of the genes of hemostatic proteins (redrawn from Tuddenham 
and Cooper, 1994). Fn: Fibronectin-like domain. Intron phase is denoted by O, I or II. 
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is also apparent when intron/exon distribution and codon usage for the active site 
serine (Brenner, 1988) are considered: the alternative codons, TCN and AGY, can- 
not be interconverted by a single nucleotide substitution. The genes encoding fac- 
tors VII, IX, X, and protein C possess the AGY codon whilst the fibrinolytic 
enzyme genes possess the TCN codon. Phylogenetic trees for the protease and 
kringle domains are however not identical (Ikeo et al., 1995) suggesting that these 
domains may have experienced different evolutionary histories. This implies that 
in multidomain proteins such as the serine proteases, each domain can represent 
an independent evolutionary unit. 


3.6.4 Protein folds, primordial exons and the emergence of exon shuffling 


Proteins are composed of combinations of secondary structural elements, o-helices 
and B-sheets, which are connected by loop regions at the surface of the molecule. 
Certain combinations of these elements (termed motifs or folds) are frequently used 
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Figure 3.9. Evolution of the serine protease family (redrawn from Patthy, 1990). The 
genealogy of the serine protease family was elucidated by reference to the known amino 
acid sequences. Dashed lines indicate proteases whose gene structure has not yet been 
established. 0, 1, 2 denote the phase class of the introns. Introns are labelled A to S, and 
putative intron addition and removal events during evolution (consistent with the 
genealogy based upon amino acid sequence data) are denoted by (+) and (—) respectively. 
Protein abbreviations used are: HP, haptoglobin; Clr and Cls, complement factors Clr and 
Cls; PT, prothrombin; PC, protein C; IX, X, VII, factors IX, X, VII; uPA, urokinase; 
tPA, tissue-type plasminogen activator; XII, factor XII; PL, plasminogen and 
apolipoprotein(a); XI, factor XI and pre-kallikrein; CFI, complement factor I; TR, 
trypsin, KL, kallikreins; EL, elastases; CH, chymotrypsin; AD, ME, adipsin, medullasin, 
mast cell proteases, cytotoxic lymphocyte proteases, cathepsin G. 
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in proteins and are known as domains. Evolutionarily mobile domains are termed 
modules. Motifs can be as simple as in the case of the hexamer repeat unit that forms 
a left-handed parallel B-helix found in UDP-N-acetylglucosamine acyltransferase 
or more complex as in the 21-26 amino acid zinc finger motif (Henikoff et al., 
1997). For practical purposes, a protein domain may be regarded as a structurally 
and/or functionally discrete portion of a protein that can fold independently into a 
stable tertiary structure and may be composed of single or multiple motifs. Since 
the zinc finger motif can fold independently, it may actually be regarded as a small 
domain in its own right. Most proteins contain two or more domains although 
some may just contain a single domain (Doolittle, 1995). 

During evolutionary time, the amino acid sequences of proteins are eroded at 
a much faster rate than are the corresponding 3D structures. In other words, 
although it may be difficult to discern sequence conservation, structural conser- 
vation may still be evident (Creighton, 1993). One example of this is provided by 
the o-defensin (DEFAI, DEFA4, DEFAS, DEFA6; 8p23) and ß-defensin 
(DEFB1, DEFB2; 8p23) genes which, despite a complete lack of DNA sequence 
homology, and major differences between their encoded proteins in terms of 
their cysteine spacing and disulfide pairing, have nevertheless evolved from a 
common ancestor (Lei et al., 1997). 

It is often quite difficult to ascertain whether structural similarities are homolo- 
gous (i.e. based upon divergent evolution from a common ancestor) or analogous (i.e. 
based upon convergent evolution to a physically favoured secondary or tertiary 
structure). The œ/ß barrel domain is common among enzymes and examples of this 
domain are thought to share a common ancestor rather than to have evolved con- 
vergently to a stable fold (Reardon and Farber, 1995). Such examples are a reflection 
of the evolution of extant protein structures from a small number of basic folds 
encoded by primordial exons. 

Primordial exons encoding functional domains are now widely dispersed 
among many diverse proteins (Doolittle, 1995). Examples include the ankyrin 
repeat (Bork, 1993), the spectrin repeat (Pascual et al., 1997), the EGF-like domain 
(Campbell and Bork 1993), the WD-repeat (Neer et al., 1994), the ATPase domain 
(Bork et al., 1992) and kringles (Patthy et al., 1984). The EGF-like domain is pre- 
sent in ~1% of human proteins whilst the immunoglobulin domain may be even 
more prevalent (Henikoff et al., 1997). Other domains such as the Kunitz domain 
which is less prevalent, may have a more recent origin (Ikeo et al., 1992). 

How many primordial exons were required to construct the huge array of extant 
proteins? Dorit et al. (1990) used the frequency with which exons in a 1200 exon 
database had been ‘reused’ between genes encoding different proteins to assess the 
size of the underlying exon pool; they estimated that the number of primordial 
exons was between 1000 and 7000. A similar study succeeded in distilling 1410 
polypeptide chains down to 112 analogous fold families whose members exhibited 
an average of <18% sequence identity (Orengo et al., 1993). Patthy (1991b) has 
pointed out, however, that this type of approach may tend to underestimate the true 
number of exons since it may not adequately have considered the pool of exons that 
participate in exon shuffling only rarely. Databases of protein folds and their fami- 
lies are available online at http://www.embl-ebi.ac.uk/dali/ (Homology-derived 
Structures of Proteins), http://www.mips.biochem.mpg.de/ (Munich Information 
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Center for Protein Sequences) and http://protein.toulouse.inra.fr/prodom.html 
(Protein Domain Database). 

Mosaic proteins assembled from class 1-1 modules are confined predominantly 
to multicellular organisms (Patthy 1991la). Exon shuffling probably only came 
into its own with the radiation of the metazoa when the requirements of multi- 
cellularity fuelled the demand for novel and increasingly more complex proteins 
to potentiate intercellular communication. Perhaps the cellular machinery for 
exon shuffling only appeared at this time. 


3.7 Pseudoexons 


Pseudoexons may be defined as gene regions that are functional in a gene from 
one species and which have been conserved in non-functional form in another 
species. Good examples of pseudoexons are those found in the human oA-crys- 
tallin (CRYAA; 21q22.3; Jaworski and Piatigorsky, 1989), glycophorin B (GYPB; 
4q28-q31; Huang et al., 1995), complement component receptor 2 (CR2; 1q32; 
Holguin et al., 1990) and transketolase-related (TKT; Xq28; Coy et al., 1996) 
genes. In the CRYAA example, the pseudoexon is alternatively spliced into the 
mRNA in rodents and bats but was lost to other mammalian species about 30-40 
Myrs ago. Exon III of the glycophorin B (GYPB; 4q28-q31) gene is expressed in 
the chimpanzee and gorilla but has been silenced as a pseudoexon in humans 
through an exon 3 donor splice site mutation (Huang et al., 1995; Kudo and 
Fukuda, 1989; Xie et al., 1997). This finding predicts a larger extracellular domain 
in the chimpanzee protein. Similarly, the human glycophorin E (GYPE; 4q28- 
q31) gene has lost two of its original exons (Kudo and Fukuda, 1990). The recom- 
binational use of these pseudoexons represents a major mechanism for allelic 
diversification in the glycophorins: depending upon the extent, location and type 
of recombination, the exchange or transfer of pseudoexons can result in the cre- 
ation of novel inter-exon or intra-exon hybrid junctions that yield new mRNA 
splicing patterns (Blumenfeld and Huang, 1995). 

Murine apolipoprotein E receptor 2 gene transcripts possess an extra exon as 
compared with transcripts from the human (LRP8; 1p34) gene (Kim et al., 1998). 
However, the human and marmoset LRP8 genes possess an intronic pseudoexon 
corresponding to the murine exon indicating that this exon has been lost in the 
primate lineage. The incorporation of the pseudoexon into the primate LRP8 
mRNA transcript is prevented by two nucleotide substitutions in the adjacent 5’ 
donor splice site. The pseudoexon contains a deletion which would have led to a 
frameshift but it is unclear whether the deletion occurred after inactivation of the 
exon or if the splice site substitutions served to allow the primate LRP8 gene to 
avoid incorporation of the mutation-containing exon. 

The human insulin-like growth factor-I] (GF72; 11p15.5) gene is transcribed and 
processed into three different mRNAs under the control of three distinct promot- 
ers, two of which have counterparts in the mouse and rat (de Pagter-Holthuisen et 
al., 1987; 1988). The 5'-most promoter, which is active in adult human tissues, con- 
trols the transcription of a cassette of three exons that comprises the 5’ UTR of the 
human gene. The murine gene does not however contain a structural and functional 
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homologue of exon 1 (and adjacent promoter) of the human gene (Rotwein and 
Hall, 1990). The absence of this cassette from murine [gf2 mRNA may account for 
the disappearance of Igf2 transcripts from most murine tissues shortly after birth. 

The presence of pseudoexons in a number of human genes has thus provided 
good evidence for the occurrence of exon inactivation (and hence also intron loss) 
during mammalian evolution. Such events are certainly compatible with the 
introns early theory. It is however unclear how common pseudoexons are in 
human genes since they will usually only be detected through the careful com- 
parison of orthologous gene pairs, a procedure that will tend to underestimate 
their prevalence. 
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4 
Genes and gene families 


4.1 The origins of human genes 


Certainly the growth of the forebrain has been a success: 

He has not got lost in a backwater like the lampshell 

Or the limpet; he has not died out like the super-lizards. 

His boneless worm-like ancestors would be amazed 

At the upright position, the breasts, the four-chambered heart, 

The clandestine evolution in the mother’s shadow. 

W.H. Auden. In Time of War, Commentary from Journey to a War (1939) 


Some human genes appear to have originated comparatively recently. Indeed a 
very few have emerged in the relatively short space of time that has elapsed since 
divergence from chimpanzee some 7 Myrs ago. Other human genes appear to have 
homologues in simple organisms implying a very ancient origin, in some cases 
even predating the divergence of prokaryotes and eukaryotes. In the following 
sections, representative examples of genes are cited in order to illustrate the enor- 
mous variability in the antiquity of extant human genes. To some extent of course, 
the origin of a given gene is a matter merely of semantic distinction. Thus 
Doolittle’s (1992) truism stated: ‘new proteins come from old proteins as a result 
of gene duplications followed by base substitutions.’ This notwithstanding, it has 
still been possible to draw up a working classificatory scheme for proteins based 
on their ‘apparent invention time’ (Zable 4.1). 


4.1.1 Genes with a specifically human origin 


There are relatively few known examples of human genes which originated in the 
last 7 Myrs after the divergence of the human and chimpanzee lineages. One such 
gene is the potentially functional immunoglobulin Vx gene located 1.5 Mb telom- 
eric to the human Cx gene which is putatively human-specific (Huber et al., 1994). 
Further studies of the immunoglobulin x locus (comprising one Ck, 5 Jk, and 76 VK 
gene segments on the short arm of human chromosome 2), have provided evidence 
for the human-specific partial duplication of the locus (Ermert et al., 1995). The 
duplicated portion of the immunoglobulin « locus, which contains 36 Vx genes and 
pseudogenes, is not found in the chimpanzee or gorilla. Thus, the duplication event 
must have occurred after the branchpoint of humans from the great apes. Huber et 
al. (1994) estimated that this duplication event occurred 1 Myrs ago. 
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Table 4.1. Classification of human proteins by ‘invention period’ (after Doolittle, 1992) 


Ancient proteins 
First editions: direct descent to both human and contemporary prokaryotes. Mostly 
mainstream metabolic enzymes (e.g. triosephosphate isomerase). 
Second editions: homologous sequences in humans and prokaryotes but apparently serve 
different functions (e.g. 27% identity between glutathione reductase 
(human red blood cells) and mercury reductase 
(Pseudomonas). 


Middle age proteins 
Proteins found in most eukaryotes but prokaryotic counterparts are as yet unknown (e.g. actin). 


Modern proteins 
Recent vintage: proteins found in animals or plants but not both. Not found in prokaryotes 
(e.g. collagen). 
Very recent inventions: Proteins confined to vertebrates (e.g. albumin). 
Recent mosaics: Modern proteins resulting from exon shuffling (e.g. low density lipoprotein 
receptor). 


Another example of human specificity is provided by the salivary amylase 
(AMYI1) genes clustered on human chromosome 1p21: three such genes are to be 
found in the human genome whereas only one is present in chimpanzee 
(Samuelson et al., 1990). The most parsimonious explanation is that amplification 
of the AMYI gene sequences has occurred during the last 7 Myrs, the time since 
human and chimpanzee last shared a common ancestor. Similarly, the high-affin- 
ity immunoglobulin receptor genes (FCGRIA, 1q21; FCGRIB, 1p12; FCGRIC, 
1q21) were triplicated from a single primate ancestral Fcgr! gene about 3 Myrs 
ago, after the divergence of chimpanzees from the human lineage (Maresco et al., 
1996). Finally, evidence for substantial deletions/translocations in the chim- 
panzee genome as compared to human is apparent from pulsed field gel elec- 
trophoretic studies of the region between the HLA-B and TNF genes on 
chimpanzee chromosome 5p21.3 (Leelayuwat et al., 1993). 

The gene coding regions of humans and chimpanzees are typically of the order 
of 99% homologous (e.g. BRCA1; Hacia et al., 1998). As yet, however, there are no 
known differences between the human and chimpanzee genomes in terms of 
either the presence/absence of genes or gene copy number that could account for 
the considerable anatomical and behavioral differences between the two species. 
We may surmise that subtle differences in the expression of regulatory genes or 
alternatively differences in the responsiveness of those genes which serve as tar- 
gets for the action of regulatory proteins, may help to explain why humans are not 
chimpanzees. Candidate genes could include not only those involved in develop- 
mental regulation but possibly also those situated adjacent to human-specific 
karyotypic rearrangements. 


4.1.2 Human genes which originated after the divergence of Old World 
monkeys and New World monkeys 


The haptoglobin gene appears to have become amplified after the separation of 
Old World monkeys from New World monkeys. The gorilla, chimpanzee, 
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orangutan and rhesus monkey have three haptoglobin genes, humans have two 
(HP, HPR; 16q22), whilst the spider monkey and cebus monkey have only one 
(McEvoy and Maeda 1988). The apolipoprotein(a) (LPA; 6q27; Lawn et al., 
1997; Pesole et al., 1994) gene probably also emerged during this time since its 
presence has only been detected in humans, apes, and Old World monkeys. 

The number of rhesus blood group antigen genes varies between primate 
species, humans having two (RHD, RHCE; 1p34-p36.2), gorillas two, chim- 
panzees three, whilst orangutan, gibbons and all Old World and New World mon- 
keys possess only a single RH-like gene (Salvignol et al., 1995). Duplication of the 
RH-like genes probably therefore occurred between 8 and 11 Myrs ago before 
divergence of the human and gorilla lineages. 

Eosinophil-derived neurotoxin (EDN) and eosinophil cationic protein (ECP) are 
host defense proteins which are members of the ribonuclease family. They are 
encoded respectively by the RNASE2 and RNASE3 genes located on human chro- 
mosome 14q24-q31. Divergence of the nucleotide sequences of the two related genes 
was found to be consistent with a duplication event occurring some 25-40 Myrs ago 
after the divergence of Old World monkeys from New World monkeys (Hamann et 
al., 1990). In agreement with this prediction, RNASE2 and RNASES3 orthologues 
are evident in chimpanzee, gorilla, orangutan and macaque but not in the marmoset, 
a New World monkey (Rosenberg et al., 1995; Zhang et al., 1998). Since their dupli- 
cation, the two proteins have diverged very rapidly, with evolution promoting the 
acquisition of novel functions viz. increased cationicity/toxicity (ECP) and enhanced 
ribonuclease activity (Rosenberg and Dyer, 1995). 


4.1.3 Human genes which originated during primate evolution 


The ZNF91 Kruppel-associated box-containing zinc finger gene family on 
human chromosome 19p12-p13 is present in the great apes, the gibbons, Old 
World and New World monkeys but is not found in prosimians or rodents 
(Bellefroid et al., 1995). The origin of this gene family is unclear. Presumably, the 
ZNF91 gene cluster arose by duplication/amplification at least 55 Myrs ago in the 
common ancestor of simians. 

The three alcohol dehydrogenase genes (ADH1, ADH2, and ADH3), linked on 
human chromosome 4q22, also originated during the adaptive radiation of the 
primates; two successive duplications are estimated to have occurred 45+8 Myrs 
and 60+8 Myrs ago (Duester et al., 1986; Ikuta et al., 1986; Trezise et al., 1989; 
1991; Yokoyama et al., 1987). 

Other examples of human genes with a primate origin include the interferon-o 
gene (IFNA) and growth hormone (GH) gene clusters. The JFNA cluster, con- 
taining 13 members (see Section 4.2.1) and located at chromosome 9p21, has 
emerged by a process of duplication and divergence over the last 26 Myrs (Miyata 
and Hayashida 1982). The GH gene cluster is also essentially a primate creation 
with several genes (GH1, GH2, CSH1, and CSH2; 17q23) having emerged over 
the last 25-50 Myrs by a process of duplication and divergence (Chen et al., 1989; 
Golos et al., 1993; Miller and Eberhardt 1983). see Section 4.2.1). 

Some genes emerged in the last 100 Myrs during the adaptive radiation of the 
mammals. For example, the fucosyltransferase gene family (FUTI, FUT2, 
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19q13.1-qter; FUT3, FUT5, FUT6, 19p13.3; FUT4, 11q21; FUT7, 9; FUT8, 
14q23) is thought to have emerged during this period by a process of gene dupli- 
cation, translocation and divergence (Costache et al., 1997). 

a-Lactalbumin is a mammary gland-expressed calcium-binding protein which 
interacts with galactosyl transferase to promote lactose synthesis. Although this 
function is clearly confined to mammals, the a-lactalbumin gene (LALBA; 
12q13) is thought to have arisen by duplication of the gene encoding lysozyme c 
which encodes a bacteriolytic enzyme present in the tissues and secretions of 
mammals, birds, reptiles and even insects. This gene duplication probably 
occurred 300-400 Myrs ago (Nitta and Sugai, 1989; Prager, 1996; Prager and 
Wilson, 1988; Qasba and Kumar, 1997) well before the evolutionary emergence of 
the mammals and the process of lactation. 


4.1.4 Human genes whose origin preceded the divergence of mammals 


Other genes were specifically vertebrate creations, their emergence through 
duplication and divergence having perhaps been made possible by the genome 
duplications thought to have occurred prior to the adaptive radiation of this sub- 
phylum (Chapter 2, section 2.1). A new wave of gene creation may have accompa- 
nied the emergence of the amphibians and early reptiles thereby equipping them 
for terrrestrial life. 

One example of a vertebrate creation appears to be the vitamin K-dependent 
serine proteases of blood coagulation (factors VII, IX, X, protein C and pro- 
thrombin). Prothrombin is present in bony fish (trout), cartilaginous fish (dog- 
fish) and the hagfish, one of the modern representatives of the jawless Agnatha 
(Banfield and MacGillivray, 1992; Doolittle, 1993). There is no evidence, how- 
ever, for the existence of thrombin or a thrombin-like protein in either proto- 
chordates or echinoderms. Whether the other four vitamin K-dependent factors 
of coagulation are present in fish is as yet unclear (Doolittle, 1993). If they are, the 
adaptive radiation of the vitamin K-dependent factors of hemostasis must have 
occurred during the space of some 50 Myrs between the divergence of the proto- 
chordates and the appearance of the Agnatha, some 450 Myrs ago (Doolittle, 
1993). The subsequent evolution of the vitamin K-dependent factors of coagula- 
tion is explored in Section 4.2.3, Serine protease genes. 

Another specifically vertebrate invention was pulmonary surfactant which 
comprises a series of proteins which serve to reduce surface tension at the air— 
liquid interface in the lung. This was probably a prerequisite for air breathing. 
The human genome contains a number of clustered pulmonary surfactant genes 
(Kolble et al., 1993), among them SFTPAI (10q21-q24) encoding pulmonary sur- 
factant protein A. Orthologues of this gene have been found in airbreathing lung- 
fish and surfactant protein has been detected both in the swimbladder of goldfish 
and in the lungs of lungfish (Sullivan et al., 1998). 

Transthyretin is a thyroid-binding protein which in human is synthesized 
mainly in the liver but also in the choroid plexus and retina. The transthyretin 
gene (TTR; 18q11-q12) is a vertebrate invention having made its first appearance 
in the stem reptiles some 300 Myrs ago (Schreiber and Richardson, 1997). 
Transthyretin is expressed in the choroid plexus in reptiles. Liver synthesis of the 
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protein only evolved later (and independently) in birds, eutherian mammals and 
in some marsupials (Schreiber and Richardson, 1997). 

Troponin I, together with troponins C and T comprise the three subunits of the 
troponin complex of the thin filaments of vertebrate striated muscle. Since inver- 
tebrates and ascidians possess a single troponin I gene (Hastings, 1997), at least 
two duplications must have taken place during the vertebrate lineage to generate 
the three troponin I genes evident in the human genome (TNNI1, 1q31; TNNI2, 
11p15; TNNI3, 19q13). 

The human ABO blood group gene (ABO; 9q34) has a common evolutionary 
origin with the chromosomally linked and now inactive a-1,3-galactosyltrans- 
ferase (GGTAI1; 9q33-q34; see Chapter 6, Section 6.2) gene; the two genes 
emerged by duplication and divergence about 400 Myrs ago during the evolution 
of the early vertebrates (Saitou and Yamamoto, 1997). 

Olfactory receptors are extremely important to mammals; dogs, rats and mice 
may have as many as 1000 genes encoding them. Although the human genome may 
contain fewer functional genes than other mammals, pseudogenes abound. Indeed, 
it has been estimated that >0.1% of the human genome is composed of olfactory 
receptor genes and pseudogenes dispersed between at least 13 different chromo- 
somes (Trask et al., 1998). [The classification of this gene family is still in its 
infancy but known members include ORIA1, 17p13; OR1D2, ORID4, ORIDS, 
17p13; OR1E1, ORIE2, 17p13; ORIFI, 16p13; ORIGI, 17p13; OR2D2, 11p15; 
OR3A1, OR3A2, OR3A3, 17p13; ORS5D3, OR5D4, 11q12; ORSFI, 11q12; 
OR6A1, 11p15; ORIOAI, 11p15]. The origin and emergence of the present-day 
size of the olfactory receptor gene family probably preceded the divergence of the 
mammals (Ben-Arie et al., 1993; Buettner et al., 1998; Issel-Tarver and Rine, 1997). 
However, at least in the primates, the olfactory receptor genes appear to have still 
been in a considerable state of flux with numerous translocations, duplications and 
deletions having occurred during the evolution of the great apes (Trask et al., 1998; 
see section 4.2.3, Olfactory receptor genes). 


4.1.5 Human genes whose origin preceded the divergence of the 
vertebrates 


One example of a human gene whose origin preceded the advent of the vertebrates 
~500 Myrs ago is the c-myc proto-oncogene (Atchley and Fitch, 1995). Present in 
all vertebrates, a homologue of the human gene (MYC; 8q24) is detectable in 
echinoderms but not in Caenorhabditis or Drosophila (Walker et al., 1992). Other 
examples include the insulin-like growth factor genes (GFI1 and IGF2; 12q22-q24 
and 11p15.5 respectively) which emerged during the evolution of the protochor- 
dates more than 600 Myrs ago (McRory and Sherwood, 1997; see Section 4.2.3). 


4.1.6 Human genes whose origin preceded the divergence of the metazoa 


Human genes with homologues in the fruitfly, Drosophila, must have originated 
prior to the divergence of the deuterostomes from the protostomes ~700 Myrs 
ago (Doolittle et al., 1996; Ayala et al., 1998). A large number of Drosophila genes 
have been shown to have human homologues and vice versa (see FlyBase at 
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http://flybase.bio.indiana.edu/). Examples (with human chromosomal locations 
where known) include proto-oncogenes jun (JUN; 1p31-p32; Perkins et al., 1988) 
and fos (FOS; 14q24; Perkins et al., 1988), flightless I LI; 17p11.2; Campbell et 
al., 1997), minibrain (MNBH;; 21q22.2; Guimera et al., 1996), dorsal (REL; 2p12- 
pl3; Steward 1987), atonal (ATOHI1; 4q22; Ben-Arie et al., 1996), archain 
(ARCNI1; 11q23; Radice et al., 1995), tumor suppressor gene lethal 2 giant larvae 
(LLGL1; 17p11-p12; Strand et al., 1995), transient receptor potential channel- 
related protein 1 (TRPC1; Wes et al., 1995), succinate dehydrogenase, iron-sul- 
fur protein subunit (SDH1; 1p22.1-qter; Au and Scheffler 1994), slowpoke (SLO, 
chromosome 10; Pallanck and Ganetzky 1994), enhancer of split (TLE1; 19p13.3; 
Stifani et al., 1992), dodo (PINI, Maleszka et al., 1996), EGF receptor (EGFR; 
7p12.1-p12.3; Livneh et al., 1985), dishevelled (DVL1; 1p36; Pizzuti et al., 1996), 
son of sevenless (SOS1, 2p21-p22; SOS2, 14q21; Della et al., 1995), awd (NME1 
(17q21.3; Zinyk et al., 1993), DdxI DEAD box polypeptide (DDX1; 2p24; Rafti 
et al., 1996), cut (CUTL; 7q22; Neufeld et al., 1992), and the paired box domain 
family of transcription factors that play an important role in development 
(PAX1, 20p11.2; PAX2, 10q25; PAX3, 2q36; PAX4, 7q22-qter; PAX5, 9p13; 
PAX6, 11p13; PAX7, 1p36; PAXS8, 2q12-q14; PAX9, 14q12-q14; Quiring et al., 
1994; Balczarek et al., 1997; Czerny et al., 1997). 

Homologs of the ETSI proto-oncogene, located on human chromosome 
11q23, have been found not only in Drosophila (Laudet et al., 1993) but through- 
out the metazoa in organisms as evolutionarily distant as sponges, anenomes, 
flatworms and nematodes (Degnan et al., 1993; Laudet et al., 1999). The origin 
of the ETS genes therefore appears to predate the divergence of the schizocoeles 
(arthropods, annelids etc) from the pseudocoeles (nematodes) more than 750 
Myrs ago (Doolittle et al., 1996). STAT (Signal Transducers and Activators of 
Transcription) proteins represent an example of proteins present in Drosophila 
and Dictyostelium but not in yeast (Kawata et al., 1997) and so must have 
emerged at some stage during the adaptive radiation of the metazoa. Homology 
is also evident between members of the human lipase gene family [lipoprotein 
lipase (LPL; 8p22), hepatic lipase (LIPC; 15q21-q23) and pancreatic lipase 
(PNLIP; 10q26)] and the yolk proteins of Drosophila (Hide et al., 1992; 
Kirchgessner et al., 1989). 

Another example of a gene with an ancestry stretching back at least as far as 
750 Myrs, is a human retina-expressed gene, CAGRI (13q13) which is homolo- 
gous to the Caenorhabditis elegans cell fate-determining gene, mab-21 (Margolis 
et al., 1996). Ahringer (1997) presented evidence that at least 50% of 
Caenorhabditis genes are likely to have counterparts in the human genome. 
Similarly, the human neurotrophic tyrosine kinase receptor (NTRK3; 15q25) 
gene has a homologue in the snail, Lymnea stagnalis (van Kesteren et al., 1998) 
although not apparently in C. elegans. 

Ancient conserved regions, deemed to be regions of the greatest structural or func- 
tional importance on account of their evolutionary conservation are also evident 
in various other proteins with homologues in both human and C. elegans, for 
example adenylate cyclases, epidermal growth factor-like domains, gelsolin, inter- 
mediate filament proteins, kinesins, neurotransmitter transporters and the ubiq- 
uitins (Green et al., 1993). 


GENES AND GENE FAMILIES — CHAPTER 4 145 


4.1.7 Human genes whose origin preceded the divergence of animals 
and fungi 


We must acknowledge that man with all his noble qualities still bears in his 
bodily frame the indelible stamp of his lowly origin. 
C. Darwin The Descent of Man (1871) 


The 12 068 kb sequence of the 16 chromosomes of the yeast Saccharomyces cere- 
vistae genome has now been established (Goffeau et al., 1996). A considerable pro- 
portion of the organism’s 5885 genes are significantly related to sequences in the 
human genome. Tugendreich et al. (1994) used the BLASTX P program to mea- 
sure the extent of this homology: 29% of human cDNAs found in GenBank 
matched a yeast protein with a P value of less than 10°. More specifically, these 
authors showed that some important human disease genes manifest considerable 
sequence similarity to a yeast counterpart e.g. adrenoleukodystrophy (ALD, 
Xq28) with a yeast 70 kD peroxisomal membrane protein, the myotonic dystro- 
phy protein (DMPK, 19q13) with a yeast cAMP-dependent protein kinase, the 
Wilms’ tumor protein (WTI, 11p13) with a yeast zinc finger protein and the Ret 
(RET, 10q11.2) protein underlying multiple endocrine neoplasia type 2A with a 
yeast cell division control protein (Tugendreich et al., 1994). Similarly, human 
genes OM (Xq28) encoding a c-jun-associated transcription factor (Farmer et al., 
1994), SECI3L1 (3p24-25) encoding a protein putatively involved in vesicle 
biosynthesis from the endoplasmic reticulum during protein transport (Swaroop 
et al., 1994), DNECL (14q32) encoding cytoplasmic dynein (Gibbons, 1995), the 
MCM family (MCM2, 3q21; MCM3, 6p12; MCM4, 8q12-q13; MCMS, 22q13; 
MCM6, 2q21; MCM7, 7q21-q22) of DNA replication proteins (Kearsey and 
Labib, 1998), the PTEN (10q23.3) tumor suppressor gene (Li et al., 1997) and the 
G protein o-subunit gene family (Wilkie et al., 1992; see Section 4.2.1), have yeast 
homologues. 

Since S. cerevisiae diverged from higher eukaryotes some 1000 Myrs ago 
(Doolittle et al., 1996), homologies between human and yeast genes are clearly 
very ancient. Such conservation at the amino acid level is likely to reflect the 
conservation of fairly basic biological functions. Thus, it comes as no surprise to 
find that the yeast genes encoding cytochrome c (Wu et al., 1986), histone 
deacetylase (Leipe and Landsman, 1997), the origin recognition complex (Gavin 
et al., 1995), the recombination protein Rad5/ (Brendel et al., 1997; Shinohara et 
al., 1993) and the nucleotide excision repair gene Rad23 (Masutani et al., 1994) 
have highly homologous human counterparts viz. CYC1 (8q24.3), HDACI 
(1p34), ORCIL (1p32), RECA (15q15.1) and RAD23A and RAD23B (19p13 and 
3p25, respectively). 

Sometimes the homology is regionally localized, for example the GT Pase-activat- 
ing protein-related domain of neurofibromin (encoded by the NFI gene on human 
chromosome 17q11.2) exhibits extensive homology to the S. cerevisiae proteins IRA1 
and IRA2 (Xu et al., 1990). Such ancient conserved regions (ACRs) represent regions of 
the greatest structural or functional importance. Eukaryotic-specific ACRs are also 
evident in various other proteins with homologues in both human and yeast, for 
example hexokinases, B-transducins, protein kinase catalytic domains and the Src 
homology domain (Green et al., 1993). 


146 HUMAN GENE EVOLUTION 


4.1.8 Human genes whose origin preceded the divergence of plants and 
animals 


Some human genes have their counterparts in both plants and fungi implying that 
they originated before the divergence of the three kingdoms (Doolittle et al., 1996). 
Members of the myb gene family (MYB; 6q23.3-q24) are found in animals, plants, 
fungi and even slime moulds (Lipsick, 1996; Rosinski and Atchley, 1998). Myb 
proteins function as regulators of cell growth and differentiation by binding to 
DNA and regulating gene expression. Since slime moulds such as Dictyostelium are 
thought to have diverged before the main eukaryotic radiation which gave rise to 
animals, plants and fungi ~1000 Myrs ago (Doolittle et al., 1996), the presence of 
Myb proteins in this organism serves to date the emergence of this ancient regula- 
tor of gene expression. Remarkably, despite this ancient origin, the Myb-related 
proteins of Drosophila and Dictyostelium are still able to recognize and interact with 
the same cognate DNA sequence as their vertebrate counterparts (Lipsick, 1996). 
The High Mobility Group (HMG) proteins are another human gene family rep- 
resented in both plants and fungi (Laudet et al., 1993). Two members of this fam- 
ily, HMGJ and HMGz2, are present on human chromosomes 13q12 and 4q31 
respectively. Members of the HMG protein superfamily are characterized by the 
possession of one or more HMG boxes, each of which comprises ~80 amino acid 
residues and is capable of interacting with DNA. Both the actins (Hennessey et 
al., 1993) and the CCAAT-specific transcription factor, NF-Y (Li et al., 1992) have 
their counterparts in plants and fungi whilst the annexins (Morgan et al., 1998) 
and the cystatins (Rawlings and Barrett, 1990) are present in plants but not fungi. 


4.1.9 Human genes whose origin preceded the divergence of prokaryotes 
and eukaryotes 


Would it be too bold to imagine that in the great lengths of time, since the 
earth began to exist, perhaps millions of ages before the commencement of the 
history of mankind, would it be too bold to imagine that all the warmblooded 
animals have arisen from one living filament which the Great First Cause 
endued with animality and thus possessing the faculty of continuing to 
improve by its own inherent activity and of delivering down those improve- 
ments by generation to its posterity, world without end. 

Erasmus Darwin Zoonomia (1794) 


Genes that are common to prokaryotes and eukaryotes must have arisen more 
than 2000 Myrs ago before the divergence of the two groups (Doolittle et al., 1996). 
A number of human genes fit into this category. Hexokinase is one of the best 
characterized examples with sequences available from bacteria, yeast, plants and 
vertebrates (three hexokinase genes HK1, HK2, and HK3 exist in humans on 
chromosomes 10q22, 2p12 and 5q35 respectively). A proposed multi-kingdom 
phylogeny of the hexokinase gene is shown in Figure 4.1 and depicts a series of 
duplication and fusion events that must have occurred during its evolution 
(Cardenas et al., 1998). The matrix metalloproteinases that play an important role 
in tissue remodelling and wound healing have homologues in plants, animals and 
bacteria (Massova et al., 1998). 
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Universal stress protein UspA of E. coli is homologous to the MADS-box 
transcription regulators of eukaryotes (Mushegian and Koonin, 1996) whilst the 
FUS6 family of eukaryotic proteins contains a putative DNA-binding domain 
related to bacterial helix-turn-helix transcription regulators (Mushegian and 
Koonin, 1996). Other human genes with homologues in E. coli are the Y-box 
family (YB1; 1p34) which encode nucleic acid-binding proteins which serve as 
transcriptional repressors in humans and the cold shock response in bacteria 
(Wolffe et al., 1992; Wolffe, 1994), the SIR2L gene family which are involved in 
cell cycle progression and chromosome stability and chromatin silencing 
(Brachmann et al., 1995), CYPS1 (7q21), the most conserved member of the P450 
monooxygenase family involved in the demethylation of sterol precursors 
(Yoshida et al., 1997), the MLH1 (3p21), MSH2 (2p21-p22), GTBP (2p16), PMS1 
(2p31-q33) and PMS2 (7p22) mismatch repair genes (Fishel et al., 1993; 
Kolodner, 1996), the RADSIA (15q15) DNA repair protein gene homologous to 
the RecA protein of E. coli (Brendel et al., 1997), and HESI (21q22.3) with 
homology to the o cross-reacting protein 27A of E. coli (Scott et al., 1997). A con- 
siderable degree of homology has also been noted between prokaryotic and 
eukaryotic RNA polymerases (Iwabe et al., 1991; Klenk et al., 1992; Sweetser et 
al., 1987), prokaryotic o factors and eukaryotic transcription factors TBP 
(TFIIB) and TFIIE (Malik et al., 1991; Ohkuma et al., 1991; Rowlands et al., 
1994; Sumimoto et al., 1991), eukaryotic translation initiation factors eIF-1A and 
eIF-5A and their prokaryotic counterparts (Hashimoto and Hasegawa, 1996; 
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Figure 4.1. A proposed phylogeny for the hexokinases (Cardenas et al., 1998). 
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Kyrpides and Woese, 1998), between prokaryotic and eukaryotic RNA-binding 
proteins (Mulligan et al., 1994) and between prokaryotic and eukaryotic type II 
DNA topoisomerases and DNA polymerases (Forterre et al., 1994). A consider- 
able degree of homology has also been noted between prokaryotic and eukaryotic 
aminoacyl-tRNA synthetases (Francklyn et al. 1997; Nagel and Doolittle, 1991; 
see Section 4.2.1, Genes encoding tRNAs and aminoacyl-tRNA synthetase). 

The eukaryotic cell cytoskeleton contains a number of proteins, the most abun- 
dant being actins, tubulins, myosins and cytokeratins (Little and Seehaus, 1988). 
The evolutionary origin of these proteins is intimately bound up with the origins 
of the eukaryotic cell itself (Doolittle, 1992). Thus y-tubulin is thought to have 
evolved before the a- and f-tubulins diverged from each other and since these 
three tubulin isotypes are found in all eukaryotes, their origins must have pre- 
ceded the eukaryotic divergence (Ludueña 1998). Consistent with this postulate, 
sequence similarities exist between eukaryotic actins and tubulins and bacterial 
ftsA and ftsZ proteins (Doolittle, 1995). 

There are many instances of genes present in E. coli or yeast having multiple 
homologues in the human genome indicating that the gene family in question must 
have expanded during evolution. One example of this are the helicase genes of the 
RecQ family of which there are five known in the human genome (WRN, 8p; BLM, 
15; RECQL, 12p12; RECQLA, 8q24.3; RECQLS, 17q25) (Kitao et al., 1998). 

As one travels further back in evolutionary time, the similarity between gene 
sequences tends to decay to an extent that only specific portions may be recog- 
nizably homologous. These ancient conserved regions (ACRs) represent those 
regions of greatest structural or functional importance and often correspond to 
specific domains. ACRs common to both prokaryotes and eukaryotes have been 
noted in a diverse array of proteins, for example enolase, glyceraldehyde 3-phos- 
phate dehydrogenase, cytochrome c oxidase subunit I, aminoacyl-transfer RNA 
(II) synthetases, HSP70 and HSP90 (see section 4.2.3), phosphoglycerate kinase, 
pyruvate dehydrogenase Ela, pyruvate kinase, ribosomal proteins L3 and PO 
and triosephosphate isomerase (Green et al., 1993). Ceruloplasmin (CP; 3q23- 
q25) and coagulation factors V (F5; 1q23) and VIII (F8C; Xq28) manifest 
homologies to the small blue proteins of bacteria which have a role in electron 
transfer (Rydén and Hunt, 1993). A 60 amino acid domain found in cystathion- 
ine-B-synthase (CBS; 21q22.3) is also found in a bacterial ABC transporter pro- 
tein and in a putative protein found in archaebacteria (Bateman, 1997). 

Phylogenetic studies have suggested that the emergence of cytochrome oxidase 
(a key enzyme in aerobic metabolism) (Castresana et al., 1994), the aldehyde 
dehydrogenases (e.g. ALDH1, 9q21; ALDH3, 17; ALDHS, 9; ALDH6, 15q26; 
ALDHAY, 1; ALDH10, 17p11.2; Yoshida et al., 1998), carbamoyl phosphate syn- 
thetase (a key enzyme of arginine and pyrimidine biosynthesis) (CPS1; 2q35; 
Schofield 1993; Lawson et al., 1996), the protein synthesis elongation factors Tu 
(TUFM; 16p11) and G (EEFIG) (Baldauf et al., 1996), tRNA splicing endonucle- 
ase (Trotta et al., 1997) and the glucose-6-phosphate isomerase gene family (GPI; 
19q13; Hattori et al., 1995) all predated the divergence of prokaryotes and eukary- 
otes. The RadA protein that catalyzes DNA pairing and strand exchange is pre- 
sent not only in all eukaryotes including human (RECA; 15q15.1) but also has 
homologues in both prokaryotes (RecA) and the archaea (Seitz et al., 1998). 
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Phylogenetic studies have provided evidence for the extreme antiquity of the 
glutamine synthetase gene (Kumada et al., 1993). Glutamine synthetase is an 
essential enzyme of nitrogen metabolism with vital functions in both glutamine 
biosynthesis and ammonia assimilation. It is thought therefore to be indispensi- 
ble to all living organisms. Comparison of the glutamine synthetase genes of 
extant organisms as diverse as bacteria and vertebrates allowed Kumada et al. 
(1993) to estimate the time of duplication of the ancestral glutamine synthetase 
gene to ~3500 Myrs ago, that is >1000 Myrs before the divergence of the 
prokaryotes and eukaryotes. 


4.1.10 The emergence of genes and gene families has paralleled 
organismal evolution 


Probably all the organic beings which have ever lived on this earth have 
descended from some one primordial form. 
C. Darwin The Origin of Species (1859) 


Eukaryotic genomes have been largely constructed by a continual process of gene 
duplication and divergence utilizing a set of basic gene types (Chapter 3, section 
3.6.4) many of which possess counterparts in the genomes of primitive organisms. 
Much of this gene diversification may have taken place in the ‘Cambrian explo- 
sion’ which is thought to have begun ~570 Myrs ago. 

Genes which have emerged during primate evolution are relatively few in num- 
ber and tend to involve the addition of new members to pre-existing multigene 
families (e.g. genes involved in host defence such as the immunoglobulins; the 
alcohol dehydrogenases whose expansion may have been associated with a move 
to a fruit diet). During mammalian evolution, certain genes emerged which were 
intimately associated with mammalian physiology such as lactation. The emer- 
gence of genes with novel functions (e.g. pulmonary surfactant) and the expansion 
of other gene families (e.g. coagulation factors and olfactory receptors) occurred in 
parallel with the emergence of the first terrestrial vertebrates. By contrast, genes 
whose origin preceded the divergence of animals and fungi, or animals and plants, 
encode proteins involved in fairly basic cellular processes. Finally, the truly pri- 
mordial genes shared by both prokaryotes and eukaryotes encode protein prod- 
ucts which are absolutely required for fundamental cellular processes such as 
DNA repair and replication, transcription and translation, as well as certain very 
basic enzymes of metabolism such as glutamine synthetase. 

Iwabe et al. (1996) performed their own analysis of gene duplication during 
organismal evolution. They concluded that most gene duplications giving rise to 
novel functions predated the divergence of the vertebrate and arthropod lineages. 
However, genes encoding products that perform virtually identical functions but 
which differ in their tissue distribution (tissue-specific isoforms) underwent 
duplications independently in vertebrates and arthropods after divergence of 
their respective lineages. Finally, genes which encode proteins that are localized 
to cell compartments (compartmentalized isoforms) emerged by duplications 
which predated the separation of animals and fungi. Iwabe et al. (1996) concluded 
that there was a good correspondence between molecular evolution at the level of 
the gene, and tissue and organismal evolution. 
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4.2 Multigene families 


4.2.1 Gene families 


Dayhoff et al. (1978) defined a gene family as a group of related genes encoding pro- 
teins differing at fewer than half their amino acid positions. Owing to complex evo- 
lutionary histories, both the size of gene families and the relationship between their 
constituent members are very variable (Henikoff et al., 1997). In principle, the mem- 
bers of multigene families can evolve in three different ways—by concerted evolu- 
tion, divergent evolution or by a birth-and-death process (Ota and Nei, 1994; Figure 
4.2). Concerted evolution may operate through either unequal crossing over or gene 
conversion (Chapter 9, section 9.5) both of which serve to homogenize multigene 
family members. Examples of this process include the rRNA and histone genes 
(Sections 4.2.2, Histone genes and 4.2.2, Ribosomal RNA genes). By contrast, some dis- 
tinct groups of genes within multigene families appear to have been maintained over 
long periods of evolutionary time. An example of this divergent evolution is provided 
by the immunoglobulin V,, genes (Section 4.2.4, Immunoglobulin genes), the diversifi- 
cation of which may have conferred a selective advantage. In other multigene fami- 
lies, gene duplication serves to create new genes whilst other genes are inactivated by 
deleterious mutation or are eliminated by unequal crossing over (‘birth-and-death‘). 
Once again, the immunoglobulin V,, genes provide an example of this process. It can 
be seen therefore that some multigene families possess members that serve to illus- 
trate the action of all three of the above processes. No one multigene family can be 
regarded as being archetypal. Each illustrates certain principles and so a variety of 
examples will be discussed. 


Actin genes. A total of six actin genes are found in mammals although a consid- 
erably larger number of pseudogenes may be found (Engel et al., 1981; Erba et al., 
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Figure 4.2. Three different modes of evolution of multigene families (after Ota and Nei 
1994). 
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1988; Gunning et al., 1994; Kedes et al., 1985; Ng et al., 1985; Ponte et al., 1993). 
The actin genes are highly conserved evolutionarily and encode protein isoforms 
that differ from each other by only a few amino acids, mostly at their amino ter- 
minals. Those characterized in human are a-smooth muscle aortic actin (ACTA2; 
10q22-q24), a-skeletal actin (ACTA; 1q42.1), a-cardiac actin (ACTC; 15q14), B- 
nonmuscle cytoplasmic (ACTB; 7p12-p15), y-nonmuscle cytoplasmic (ACTG1; 
17q25) and y-smooth muscle enteric (ACTG2; 2p13.1) actins. Comparison of the 
structures of these six human actin genes (Figure 4.3) indicates that comparable 
regions are identically interrupted by introns although the sizes of the introns dif- 
fer (Miwa et al., 1991). Comparison of nucleotide and amino acid sequences as well 
as gene structures also allowed Miwa et al. (1991) to propose a possible phyloge- 
netic tree for the actin gene family (Figure 4.4). Since all four muscle actins differ 
from the cytoplasmic actins by substitutions at 19 amino acid residues, the mus- 
cle actins must have shared a common ancestral actin gene. Duplication and 
divergence then led to the emergence of the smooth muscle and striated muscle 
actin genes. Introns were gained or lost at different stages in this process. The 
ancestral smooth muscle actin gene acquired an amino acid substitution at 
residue 89 before being duplicated; the smooth muscle y-actin subsequently lost 
an amino acid residue at position 4. 
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Figure 4.3. Comparison of gene structures of the six human actin genes (after Miwa et al., 
1991). Triangles indicate intron positions. The numbers below indicate intron sizes in 
base-pairs or kilobases (in square brackets). The numbers above the lines indicate the 
sizes of the 5’ UTR in exons 1 and 2 and those of the 3’ UTR in the last exon. 
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Figure 4.4. Hypothetical phylogenetic tree for the human actin gene family (after Miwa et 
al., 1991). The vertical scale is non-linear and merely represents relative evolutionary time. 


Albumin genes. The human albumin gene family comprises four genes, encoding 
albumin (ALB), a-fetoprotein (AFP), o-albumin/afamin (AFM) and group-spe- 
cific component/vitamin D-binding globulin (GC) that are closely linked on chro- 
mosome 4q11-q13 (Nishio et al., 1996). These genes are similar in terms of their 
structure and encode homologous proteins (Nishio and Dugaiczyk, 1996). A pro- 
posed phylogeny for this gene family is presented in Figure 4.5. The GC gene 
appears to be the oldest member of the family; it has lost two of its original 15 
exons and with them four of the original 18 disulfide bridges characteristic of the 
putative ancestral protein. The AFP and ALB genes have been evolving at a par- 
ticularly rapid rate (Minghetti et al., 1985). However, the ALB gene still displays a 
high degree of evolutionary conservation in terms of its sequence which is perhaps 
a little surprising on account of its apparent nonessential nature evidenced by the 
absence of overt clinical signs in analbuminemic humans and rats (Ohno, 1981). 

Albumin is synthesized in the adult liver whereas o.-fetoprotein, being the fetal 
counterpart of albumin, is produced in the fetal liver and yolk sac. The mecha- 
nism of the developmental switch bringing about AFP gene repression and ALB 
gene activation is not yet understood (Nakata et al., 1992). 

The chromosomal region containing the albumin gene cluster has been involved 
in a number of pericentric inversions during the evolution of higher primates. As a 
result of one such inversion, the ALB and AFP genes were translocated to the short 
arm of chimpanzee chromosome 3 (analogous to human chromosome 4) (Magenis et 
al., 1987). Similar inversions in the gorilla and orangutan however left the ALB and 
AFP genes on the long arm of chromosome 3 in these species (Magenis et al., 1989). 


Apolipoprotein genes. The lipoproteins are the major carriers of cholesterol, 
triglycerides and other lipids in human plasma (Breslow 1985). They are encoded by 
a multigene family which is dispersed but not fully dispersed in the human genome: 
APOCI, APOC2, APOC4, APOE (19q13.2) APOC3, APOA1, APOA4 (11q23), and 
APOAZ2 (1q21-q23). The genes evolved from a primordial APOCI-like gene, which 
is thought to have existed ~680 Myrs ago, via a series of internal and complete gene 
duplications (Figure 4.6; Luo et al., 1986; Boguski et al., 1986). Since both apoA1 and 
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Figure 4.5. Proposed model for the evolution of the human albumin gene family (after 
Nishio et al., 1996). Tel: telomere. Cen: centromere. -S-S- : disulfide bridge. The ALB 
and GC genes diverged before the emergence of amphibians 400 Myrs ago whilst the 
AFP gene emerged after the separation of amphibians and reptiles some 350 Myrs ago. 


apoE genes have been found in fish, the duplication event from which they both 
arose must have occurred before the divergence of tetrapods and teleost fish (Babin 
et al., 1997). The structures of the genes show remarkable similarities: all (with the 
exception of APOA4) possess four exons with introns interrupting the coding 
sequence at very similar locations (Figure 4.7). The major difference between the 
genes is in the length of the last exon which encodes a variable number of lipid-bind- 
ing domains that contain multiple repeats of 22 amino acids each of which represents 
a tandem array of two 11-mers (Li et al., 1988). 

Two of the major known apolipoproteins do not fit easily into the above scheme: 
the 29 exon APOB (2p24) gene encodes a protein which may be distantly related to 
the other apolipoproteins (Li et al., 1988) whilst the APOD (3q26.2-qter) gene 
encodes a protein with a high degree of homology to retinol-binding protein but 
little similarity to the other apolipoproteins (Drayna et al., 1987). 


Complement genes. The vertebrate complement system may be regarded as a 
primitive immune system which lacks the capability to recognize foreign anti- 
gens made possible by the evolution of the MHC complex, the T cell receptors 
and the immunoglobulins. Gene duplication has played a major role in the evo- 
lution of the complement components (Farries and Atkinson, 1991). This is still 
evident from the close linkage exhibited by the human complement genes: 
C1QA, CIQB, C1QG (1p34-p36), C1S, CIR (12p13), B factor (BF), C2, C4A, 
C4B within the HLA locus at chromosome 6p21, C6, C7, C9 (5p13), C8A, C8B 
(1p32), C5, C8G (9q34), membrane cofactor protein (MCP), H factor (HF1) and 
decay accelerating factor (DAF) on chromosome 1q32. Some loci however 
appear to be isolated, for example I factor (ZF; 4q25) and C3 (19p13). Thus, 
some complement components exhibit extensive structural homology even 
though their genes are not linked (e.g. C3, C4, and C5). 
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Figure 4.6. A hypothetical scheme for the evolution of the apolipoprotein genes (after 
Luo et al., 1986). Numerals refer to the number of specified mutational events. 
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Figure 4.7. Structural organization of the human apolipoprotein genes (after Li et al., 
1988). Exons are denoted by bars, the 5’ and 3’ flanking regions and introns by lines. 
Open bars represent the 5’ and 3’ untranslated regions, hatched bars the signal peptide 
regions and solid bars the mature peptide regions. The numbers above the exons indicate 
their length in base-pairs. 


The genes encoding the different components of the complement system 
belong to several evolutionarily unrelated gene families. Phylogenetic analysis of 
the C3, C4 and CS genes, together with the more distantly related «2-macroglob- 
ulin (42M; 12p12-p13) gene, supports the view that C5 diverged first with C3 and 
C4 subsequently diverging before the separation of the jawed and jawless fishes 
(Hughes, 1994). Similar analyses have been performed for the complement Cl 
components (Dodds and Petry, 1993). 

Exon shuffling has been extremely important in diversifying the structure of 
the complement genes thereby conferring novel functions upon the complement 
proteins. Thus for example, Clr, Cls, and C2 possess serine protease domains, C6, 
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C7, C8a, C8B, and C9 possess thrombospondin and low density lipoprotein recep- 
tor domains, and Clr, Cls, C6, C7, C8a, C8b, and C9 possess homologies with epi- 
dermal growth factor (Farries and Atkinson, 1991). 


Crystallin genes. The a@-, B-, and y-crystallins of the lens are ubiquitous in verte- 
brates. These proteins have ancient origins as evidenced by the relationship of the 
a-crystallins to small heat shock proteins (Caspers et al., 1995; Lu et al., 1995) and 
the relationship between the B- and y-crystallins to proteins found in the bac- 
terium Myxococcus and the primitive eukaryote Physarum respectively (Wistow, 
1993). In humans, there appear to be two active a-crystallin genes (CRYAA, 
21q22.3; CRYAB, 11q22.3-q23.1), six active f-crystallin genes (CRYBB2, 
CRYBB3, CRYBA4, CRYBB1, 22q11.2-q12.1; CRYBA2, 2q34-q36; CRYBA1, 
17q11.1-q12) and five active y-crystallin genes (CRYGS, 3; CRYGA, CRYGB, 
CRYGC, CRYGD, 2q33-q35). 

Vertebrate lens crystallins have undergone dramatic changes during evolution 
including gene duplication, gene inactivation, gene recruitment and changes in 
gene expression (Lu et al., 1996; Lubsen et al., 1988; Wistow, 1993). The vertebrate 
lenses with the highest refractive index are found in fish and in the predominantly 
nocturnal rodents; in these lenses, y-crystallin predominates (Graw et al., 1993). 
Most vertebrates have however reduced the refractive index of their lenses in order 
to aid accommodation and focusing at a distance; this process has been accompa- 
nied by the loss of y-crystallins (for example, replaced by 5-crystallins in birds). 
Humans have reduced y-crystallin levels by partial or complete inactivation of sev- 
eral members of the y-crystallin gene family once found in the common ancestor of 
primates and rodents (Brakenhoff et al., 1990; den Dunnen et al., 1989; Lubsen et 
al., 1988; Meakin et al., 1985, 1987). Thus, of the human y-crystallin genes, only 
CRYGC and CRYGD (2q33-q35) are still fully active. Several y-crystallin genes 
have been inactivated by deletion or point mutation whilst the CRYGA and 
CRYGB (2q33-q35) genes have been down-regulated by promoter mutation and 
increased mRNA turnover respectively (Brakenhoff et al., 1990). 

In addition to the ubiquitous crystallins mentioned above, there is a group of 
taxon-specific crystallins that are restricted to certain evolutionary lineages. 
These crystallins are pre-existing metabolic enzymes which have been recruited 
to a novel lens-specific role, not on account of their enzymatic activity but rather 
for structural reasons in that they are able to modify the refractive index of the 
lens. Thus, in most reptiles and birds, 52-crystallin is the enzyme argininosucci- 
nate lyase (Piatigorsky et al., 1988) whilst 51 crystallin has arisen by duplication 
of the 62 crystallin gene but lacks its enzymatic activity (Mori et al., 1990). 
Various other examples of gene recruitment (also termed gene sharing) have been 
documented among the crystallins, for example lactate dehydrogenase B (crys- 
tallin £; crocodiles and some birds), NADPH: quinone oxidoreductase (crys- 
tallin ¢; guinea pigs and camelids) and aldehyde dehydrogenase 1 (crystallin n; 
elephant shrews). Recruitment of these gene products to a new lens function has 
been achieved by the acquisition of lens expression for these genes through the 
emergence of novel promoter elements (Wistow, 1993). In the case of C-crystallin, 
recruitment appears to have occurred independently in guinea pigs and camels; 
different regions of the same nonfunctional intronic sequence have been altered 


156 HUMAN GENE EVOLUTION 


by single base-pair substitutions and micro-deletions and insertions to perform 
a role in lens-specific expression (Gonzalez et al., 1995). By contrast, in birds, 
recruitment of the 61-, 62- and €-crystallins appears to have come about through 
the modification of pre-existing promoters utilized for nonlens tissue expression 
(Hodin and Wistow, 1993). 


Collagen genes. The collagens are multi-domain proteins which serve as 
structural molecules of the extracellular matrix. They have a very ancient origin 
as evidenced by their presence in sea urchins, annelids, Drosophila and even jel- 
lyfish and sponges (Exposito and Garrone, 1990; Exposito et al., 1991; reviewed 
by Garrone, 1998). These proteins contain one or more domains with a triple 
helical conformation characterised by a repeating Gly-X-Y amino acid motif. 

A total of 19 types of collagen (I to XIX) have so far been defined in humans; these 
comprise homo- or heteromeric assemblies of specific polypeptide chains (Table 4.2). 
A minimum of 34 genes on 13 different human chromosomes are required to encode 
these chains (Table 4.2). Based on their structures, the collagens may be divided into 
several distinct classes: fibrillar (I, II, III, V, XI), basement membrane (IV), fibril- 
associated with interrupted triple helices (IX, XII, XIV, XIX), filament-producing 
(VI), network forming (VIII, X) and anchoring fibril (VII). The class structure of the 
proteins and the homologies between the members of each class is reflected in the 
evolutionary relationships between the genes that encode them, their structure and 
often their chromosomal location (reviewed by Prockop and Kivirikko, 1995; van 
der Rest and Garrone, 1991; Vuorio and de Crombrugghe, 1990). 

The fibrillar collagens are proteins with a continuous triple helical domain 
containing uninterrupted Gly-X-Y motifs. Their genes are thought to have 
diverged some 800-900 Myrs ago (Runnegar 1985) but still share a very similar 
exon-intron organization comprising 52-54 exons of which 42 have specific 
lengths: 45 bp, 54 bp, 99 bp, 108 bp, and 162 bp representing multiples of the 9 bp 
encoding the Gly-X-Y motif. The latter three lengths are combinations of the first 
two suggesting that they may have been derived from an ancestral 54 bp exon by 
duplication and recombination events (Figure 4.8). These events may have been 
mediated by recombination between introns (Butticé et al., 1990; Chu et al., 1984; 
Yamada et al., 1980). Partial gene duplication events are unlikely to have disrupted 
gene function because multiples of 9 bp would not have altered the reading frame 
of the proteins. The 66 exon COLSAI and COLIIA2 genes appear to have 
increased in size through an increase in the number of 54 bp exons (Takahara et 
al., 1995; Vuoristo et al., 1995). The COL3A1 and COLSA2 genes are both located 
at chromosome 2q31, indicative of a close evolutionary relationship. 

Although the genes encoding the basement membrane collagens possess 46-52 
exons, they differ from the fibrillar collagen genes in that they possess many fewer 
exons of length 45 bp and 54 bp. Further, the exons vary widely in size, do not 
always begin with a Gly codon and often split codons. Their similar exon-intron 
organization testifies however to a common evolutionary origin. Moreover, the six 
type IV genes are arranged in syntenic pairs [COL4A] and COL4A2 (13q34), 
COL4A3 and COL4A4 (2q36-q37), COL4A5 and COL4A6 (Xq22)] with head-to- 
head arrangement (sometimes sharing promoter elements), consistent with a 
model of gene duplication and divergence. 
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Table 4.2. Types of human collagen and the genes that encode them 


Collagen type Constituent chains Gene symbol Chromosomal localization 

l a1(1) COL1A1 17q21.3-q22 
a2(l) COL1A2 7q22.1 

Il a1 (Il) COL2A1 12q13 

lll a1 (III) COL3A1 2q31 

IV a1(IV) COL4A1 13q34 
a2(IV) COL4A2 13q34 
a3(IV) COL4A3 2q36-q37 
a4(IV) COL4A4 2q36-q37 
a5(IV) COL4A5 Xq22 
a.6(IV) COL4A6 Xq22 

V a1(V) COL5A1 9q34.2-q34.3 
a2(V) COL5A2 2q31 

VI a1(VI) COL6A71 21q22.3 
a2(VI) COL6A2 21q22.3 
a3(VI) COL6A3 2q37 

VII &1(VII) COL7A1 3p21.3 

VIII a1 (VIII) COL8A1 3q12-q13 
a2(VIII) COL8A2 1p32.3-p34.2 

IX a1 (IX) COL9A1 6q13 
a2 (IX) COL9A2 1p32.2-p33 
a3 (IX) COL9A3 20q13.3 

X a1(X) COL10A1 6q21-q22.3 

xI a1(XI) COL11A1 1p21 
a2(XI) COL11A2 6p21.3 

XII &1(XII) COL12A1 6q12-q13 

XIII &1(XIII) COL13A1 10q22 

XIV a1(XIV) COL14A1 8q23 

XV a1(XV) COL15A1 9q21-q22 

XVI a1(XVI) COL16A1 1q34 

XVII a1 (XVII) COL17A1 10q24.3 

XVIII a1(XVIII) COL18A1 21q22.3 

XIX &1(XIX) COL19A1 6q12-q14 


The fibril-associated collagens with interrupted triple helices are distinct from 
both the fibrillar and basement membrane collagens. The COL9A1, COL9A2, 
COLI12A1, and COL19A1 genes are similar in structure indicative of a close evo- 
lutionary relationship (Kaleduzzaman et al., 1997): all contain some exons of 
length 54 bp. The COL9A1, COLI2A1, and COL19AI1 genes are also closely 
linked on chromosome 6q12-q14. 

The genes encoding the filament-producing collagens, COL6A/ and COL6A2, 
possess a similar structure and are closely linked in head-to-tail orientation 
(Trikka et al., 1997). Although they contain exons of length 45 bp and 54 bp, most 
are 63 bp in length. The type VII collagen gene, COL7A1, contains 118 exons 
some of which are of 45 bp or 54 bp in length but rather more are of 36 bp, the 
remainder invariably being multiples of 9 bp (Christiano et al., 1994). By contrast 
to the extremely fragmented organization of the other collagen genes, the 
COLIOAI1 gene is remarkably compact, containing only three exons. 

The evolution of the collagens is thus broadly consistent with a model of gene 
duplication and divergence coupled with intragenic exon duplications, deletions 
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Figure 4.8. A hypothetical 

| scheme for the evolution of the 

| exons in a region of the a2(I) 

collagen (COLIA2; 7q22) gene 

Y coding for the helical region of 

54 54 54 54 the polypeptide (Li, 1997). The 

number of base-pairs in each 

i exon is given. The dotted line 

denotes the position of an 

| unequal crossover. An unequal 
crossover between two 54 bp 


exons could give rise to an 
exon of 99 bp and an exon of 
9 bp whereas an unequal 
crossover between an exon of 

| 99 bp and exon of 54 bp could 
99 54 54 54 give rise to an exon of 108 bp 
{ and an exon of 45 bp. 


and fusions. The recurring sizes of many of the exons that encode the extant 
human collagens provides ample evidence for the exon amplification and con- 
traction processes that must have fashioned them. Although the evolution of sev- 
eral families of related collagen genes has been studied in some detail, the 
evolution of the gene family in its entirety has not yet been pieced together. Since 
both the fibrillar and nonfibrillar collagens have a very ancient origin, this task is 
one of some very considerable complexity. 


Genes for the fibroblast growth factors and their receptors. The fibroblast 
growth factors (FGF) constitute a family of polypeptide growth factors that have 
multiple functions in mitogenesis, angiogenesis and wound healing. They con- 
tain an extracellular portion with either 2 or 3 immunoglobulin-like domains, a 
transmembrane domain and a cytoplasmic tyrosine kinase domain. The FGF 
family of proteins interacts with the membrane-associated FGF receptors 
(FGFR) and these interactions are important for a variety of developmental 
processes such as the formation of the mesoderm during gastrulation, the integra- 
tion of growth, budding and patterning during early post-implantation, and the 
development of various tissues including the skeletal system. 

A total of 14 FGF genes have been identified in the human genome: FGF1 
(5q31), FGF2 (4q25-q27), FGF3 (11q13), FGF4 (11q13), FGFS (4q21), FGF6 
(12p13), FGF7 (15q15-q21), FGF8 (10q24), FGF9 (13q12), FGF10 (Sp12-p13), 
FGFI11 (17q21), FGF12 (3q28), FGF13 (Xq26) and FGF14 (13q34). Thus, with 
the exception of the closely linked and tandemly repeated FGF3 and FGF4 
genes, the FGF genes are dispersed in the human genome. FGF genes are pre- 
sent in invertebrate genomes and probably expanded in number after the diver- 
gence of protostomes and deuterostomes. The mammalian FGF gene family has 
emerged by a process of phased duplication and divergence (Figure 4.9; Coulier et 
al., 1997; Johnson and Williams, 1993). Some of the paralogous mammalian FGF 
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genes may be derived from genome duplications (Chapter 2, section 2.2) that 
occurred early on in the evolution of the vertebrate genome, for example FGF4 
and FGF6 on chromosomes 11 and 12, respectively, and FGF] and FGF2 on 
chromosomes 4 and 5, respectively. 

The generation of FGF diversity may have played a role in the various innova- 
tions of the vertebrate skeletal system and the novel FGF family members are 
likely to have co-evolved with the FGFRs. FGFR genes are also found in inverte- 
brate genomes (Coulier et al., 1997) and four are found in the human genome: 
FGFRI (8pll), FGFR2 (10q26), FGFR3 (4p16) and FGFR4 (5q35-qter). 
However, by contrast with the FGF family expansion, the FGFR gene family 
appears to have undergone only a single phase of expansion (Coulier et al., 1997). 


GABA receptor genes. The y-aminobutyric acid (GABA) receptor is a pentameric 
ion channel complex which mediates fast inhibitory synaptic transmission in the 
central nervous system. The receptor exists as many different isoforms assembled 
from different combinations of subunit subtypes, 0,5 Bip Vıp 5 € and p, The 
human GABA receptor genes are clustered at different chromosomal locations: 
GABRA2, GABRA4, GABRBI, GABRGI (4p12-p13), GABRAI, GABRA6, 
GABRB2, GABRG2 (5q34-q35), GABRA5, GABRB3, GABRG3 (15q11-q13), 
GABRA3, GABRB4, GABRE (Xq28), GABRRI, GABRR2 (6q14-q21), and 
GABRD (1p) (McLean et al., 1995; Russek and Farb, 1994). The organization of this 
gene family is consistent with an evolutionary model of intracluster gene duplication 
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Figure 4.9. A hypothetical scheme for the evolution of the fibroblast growth factors and 
their receptors. The putative phases of gene duplications (on the left) are tentatively 
related to a phylogenetic tree of the metazoa (on the right). After Coulier et al. (1997). 
Since this phylogenetic study was performed, several further FGF genes have been 
characterized in the human genome (see text). 
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as well as the duplication of a primordial gene cluster. The tandem duplication of the 
ancestral œ subunit gene appears to have occurred before the duplication and translo- 
cation of at least one of the original gene clusters. 


G protein asubunit genes. G proteins are ubiquitous in eukaryotes and serve to 
mediate signal transduction via appropriate receptors (see Section 4.2.3, G-pro- 
tein-coupled receptor genes) by coupling extracellular signals to intracellular effec- 
tors such as adenylyl cyclase, phospholipases and ion channels. More than a 
dozen G protein @ subunit genes have been identified in the human genome 
(Figure 4.10) and their evolutionary history appears to stretch back over 1500 
Myrs (Wilkie et al., 1992). Whilst some members of the human gene family are 
chromosomally linked [e.g. GNAI? and GNATI (3p21), GNAI3 and GNAT2 
(1p13), GNAJII and GNA1I5 (19p13)], the remainder are scattered around the 
genome. This organization is potential explicable by a combination of successive 
genome duplications, tandem duplication, and duplication/translocation events 
(Wilkie et al., 1992). 


Globin genes. Proteins homologous to the globins are ubiquitous in eukaryotes 
and even have counterparts among the prokaryotes (Hardison, 1998; Riggs, 1991). 
The myoglobin (MB; 22q11.2-q13) gene and the ancestor of the extant globin 
genes are thought to have arisen from a common ancestral gene encoding a heme- 
containing protein ~700 Myrs ago before the advent of the vertebrates 
(Czelusniak et al., 1982; Suzuki and Imai, 1998). The a- and B-globin genes 
diverged from each other about 500 Myrs ago and at some stage became chromo- 
somally separated (Efstratiadis et al., 1980). The a-globin cluster subsequently 
evolved by a process of successive duplication and divergence (Figure 4.11): the 
C/a gene divergence occurring about 400 Myrs ago (Czelusniak et al., 1982; 
Proudfoot et al., 1982) and the 0/a gene divergence about 260 Myrs ago (Hsu et al., 
1988). The œl- and o2-globin genes arose from a further duplication event 
between 50 Myrs and 60 Myrs ago whilst the yal pseudogene was inactivated 
~45 Myrs ago (Proudfoot and Maniatis, 1980; Figure 4.11). The 40 kb human 
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Figure 4.10. Human G protein 
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a-globin gene cluster (16p13.3) therefore comprises the embryonically expressed 
C (HBZ) gene, the post-natally expressed «2 (HBA2) and al (HBA1) genes, the 
-globin (HBQ1) gene (probably a transcribed pseudogene) and three conven- 
tional pseudogenes (HBZP, HBAP1, HBAP2) (Figure 4.12). The genes in the &- 
globin gene cluster are therefore arranged in the order of their activation during 
ontogeny. The human HBA/ and HBA2 genes have remained virtually identical 
to each other as a consequence of crossing over and gene conversion (Bailey et al., 
1992; Hess et al., 1984; Michelson and Orkin 1983). 

The ancestral B-globin gene is thought to have duplicated about 200 Myrs ago to 
yield a 5/6 ancestral gene and a €/y ancestral gene (Hardies et al., 1984; Figure 4.11). 
The ô- and B-globin genes diverged from the 6/8 ancestral gene about 40 Myrs ago. 


Cw 8 Yaz Wat 2,064 B 8 Gy Ay £ 


Figure 4.11. Evolution of the vertebrate globin genes (redrawn from Efstratiadis et al., 
1980 and Higgs et al., 1989). Numbers denote approximate estimated times of divergence 
in Myrs before present. 
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Figure 4.12. Organization of the human a&- and B-globin clusters. Arrows denote the 
direction of transcription. Solid boxes denote genes, empty boxes pseudogenes and the 
hatched box the 6-globin (HBQ1) gene which is transcribed but probably not expressed. 
The relative positions of the locus control regions (LCRs) are denoted together with the 
constituent DNase I hypersensitive sites (HS). 
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The e- and y-globin genes diverged from the ¢/y ancestral gene about 100 Myrs ago 
(Figure 4.11) followed by the duplication of the y-globin gene to form the Gy and Ay 
genes, an event which occurred before the divergence of the Old World and New 
World monkeys (Chiu et al., 1997; Slightom et al., 1987). The human B-globin gene 
cluster (11p15.5) therefore comprises the embryonically expressed e (HBE1) gene, 
the fetal Gy (HBG2) and Ay (HBG1) genes, the post-natally expressed B (HBB) 
and 6 (HBD) genes plus a single yn pseudogene (HBBP1) (Figure 4.12). Thus, as 
with the o-globin genes, the genes in the 75 kb B-globin gene cluster are arranged 
in the order of their activation during ontogeny. 

The HBD gene is present in humans, apes and New World monkeys but has 
been inactivated in Old World monkeys (Kimura and Takagi, 1983; Martin et al., 
1980). During evolution, gene conversion has operated on the HBD gene allow- 
ing it to acquire sequence characteristics of the HBB gene (Koop et al., 1989; Tagle 
et al., 1991). Gene conversion events have however been even more frequent 
between the HBG2 and HBGI genes (Chiu et al., 1997; Slightom et al., 1987; 
1988). In addition to gene conversion, a number of other events have occurred 
during evolution which have altered the B-globin cluster in specific ways in dif- 
ferent mammalian orders; these include insertion of repeat sequences, change in 
expression profile, gene duplication, gene fusion and gene loss or inactivation 
(Figure 4.13). Thus, the lagomorphs and rodents lack the n-globin locus whilst the 
artiodactyls lack the y-globin locus. As a consequence of some of these changes, 
the B-globin cluster in mammals varies from as little as 20 kb in lemurs to about 
90 kb in goats (Figure 4.13). 

The £ and y globin genes were originally embryonically expressed and fetally 
inactive and this early expression pattern is still found in the galago and lemurs 
(Tagle et al., 1988). In higher primates, the y-globin gene was duplicated before the 
divergence of Old World and New World monkeys (perhaps by an unequal homol- 
ogous crossing over event mediated by LINE elements; Fitch et al., 1991) and 
became fetally expressed as a direct result of the accumulation of sequence 
changes in the 5’ flanking region (Johnson et al., 1996; TomHon et al., 1997). In 
Old World monkeys, apes and humans, both the yl- and y2-globin genes are func- 
tional but the expression of the yl gene is three-fold higher than that of the y2 
gene. In New World monkeys, only one y-globin gene is functional, usually y2 
(Chiu et al., 1996; 1997; Johnson et al., 1996; Meireles et al., 1995). 


Growth hormone and somatomammotropin genes. The growth hormone 
and prolactin genes are thought to have emerged as a result of a duplication 
event some 470 Myrs ago. In nonprimates, with the exception of caprine rumi- 
nants (Wallis et al., 1998), GH is encoded by a single gene whilst in primates, the 
gene has expanded to a gene cluster. Thus, the human gene encoding pituitary- 
expressed growth hormone (GH) is located on chromosome 17q23 within a 
cluster of five related genes (Chen et al., 1989). The other loci present in the 
growth hormone gene cluster are two chorionic somatomammotropin genes 
(CSH1 and CSH2), a chorionic somatomammotropin pseudogene (CSHP1) and 
a second growth hormone gene (GHZ). These genes are separated by intergenic 
regions of 6 kb to 13 kb in length, lie in the same transcriptional orientation, are 
placentally expressed and are under the control of a downstream tissue-specific 
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enhancer (Jacquemin et al., 1994). The GH2 locus encodes a protein that differs 
from the GH1-derived growth hormone at 13 amino acid residues. All five genes 
share a very similar structure with five exons interrupted at identical positions 
by short introns (Hirt et al., 1987). 

On the basis of sequence data, it was initially thought that the evolution of the 
GH gene cluster had taken place comparatively recently over the last 15 Myrs by a 
process of gene duplication and divergence (Chen et al., 1989; Miller and 
Eberhardt, 1983). The first event is thought to have been the duplication of a sin- 
gle ancestral GH gene to generate pre-GH and pre-CSH genes (Barsh et al., 1983) 
followed by the duplication of the newly created gene pair to yield GH1, CSH1, 
GH2, and CSH2 (Figure 4.14). Finally, the CSH1 gene was duplicated to form two 
genes, one of which (CSHP1) became inactivated through mutation. Although 
this sequence of events is probably correct, the estimated timings now appear to be 
seriously inaccurate because a GH2 gene has subsequently been reported in 
macaque (Golos et al., 1993). The duplication event creating the GH] and GH2 
genes must therefore have occurred before the divergence of Old World monkeys 
and the anthropoid apes ~25 Myrs ago. No study has been performed on either 
prosimians or New World monkeys and so the timing of this duplication event 
may have to be revised still further as new data emerge. The initially misleading 
conclusion as to the timing of the duplication events was probably due to a failure 
to consider the effect of gene conversion in minimizing sequence differences 
between the different GH loci (Giordano et al., 1997). 

The 70 kb human GH gene cluster contains some 48 Alu sequences (Chen et al., 
1989) some of which may have mediated the unequal recombination events 
responsible for the gene duplications (Barsh et al., 1983). On the other hand, some 
Alu sequences have become duplicated along with their associated genes during 
the duplication process. One consequence of the relatively recent evolutionary 
changes in the GH gene cluster is that multiple sequence homologies and internal 
repetitions are still evident within it. 


Glycophorin genes. The human glycophorins are encoded by a multigene fam- 
ily that has evolved from an ancestral gene that ceased to be functional at some 
stage during primate evolution. Glycophorins A and B are the major sialoglyco- 
proteins of the human erythrocyte membrane and carry the MN and Ss blood 
group antigens respectively. They are encoded by the GYPA and GYPB genes 
which occur in a 330 kb cluster together with the glycophorin E (GYPE) gene on 
chromosome 4q28-q31. This cluster is thought to have arisen by two successive 
duplications, the first creating the GYPA gene by duplication of an ancestral gene 
(between 9 Myrs and 35 Myrs ago) and the second (5-21 Myrs ago) generating the 
GYPB and GYPE genes (Kudo and Fukuda, 1989; Onda and Fukuda, 1995; 
Figure 4.15). The GYPB gene differs from the GYPA gene by virtue of the pres- 
ence of, (i) a G>T transversion at the +1 position of the intron 3 donor splice site 
which serves to inactivate the expression of exon III and (ii) a 9 bp insertion at the 
5’ end of exon V (Blumenfeld et al., 1997). The GYPE gene differs from the 
GYPB gene in that exon IV has been inactivated by a splice site mutation, exon V 
contains a 24 bp insertion and the encoded protein has been shortened by 5 amino 
acids through the introduction of a premature Stop codon (Blumenfeld et al., 
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Figure 4.14. Evolution of the human growth hormone gene cluster by serial unequal 
crossing over events (after Phillips, 1995). (a) Duplication to yield pre-growth hormone 
(GH) and pre-chorionic somatomammotropin (CSH) genes. (b) Unequal crossing over 
between pre-GH and pre-CSH genes to yield (c) the GH1, GH2, and CSH2 genes plus a 
CSHP1 locus which was to become inactivated. (d) Subsequently, the CSH1 gene emerged 
by duplication of the CSHP1 gene. 


1997). Both the GYPB and GYPE genes lack exons 6 and 7 present in the GYPA 
gene that encode the cytoplasmic domain. 

The ancestral gene from which the GYPA gene was derived is thought to have 
been inactivated but its remnants may still be present on chromosome 4, just 
downstream of the GPA gene (Onda et al., 1993, 1994). The ancestor of the human 
GYPB and GYPE genes underwent a rearrangement in its 3’ region by recombi- 
nation between Alu sequences and in so doing, acquired 3’ sequence from another 
gene. This unequal crossing over event served to generate short last exons in both 
GYPB and GYPE genes; that of GYPB comprises a single amino acid whilst that 
of GYPE encodes an untranslated region. 

At least one GYPA-like gene occurs in all hominoid primates whereas one 
GYPB.-like gene is present in man, both species of chimpanzee and the gorilla but 
not orangutan and gibbon (Rearden et al., 1993). The GYPE gene is present in all 
hominoid primates that possess a GYPB gene, but is polymorphically 
present/absent in gorillas (Rearden et al., 1993). Thus duplication of the 
GYPB/GYPE progenitor sequence to yield the GYPB and GYPE genes must have 
occurred prior to the divergence of the human-gorilla-chimpanzee clade ~10 Myrs 
ago. Subsequently, the GYPE gene has acquired DNA sequence from the GYPA 
gene by gene conversion (Kudo and Fukuda, 1994; Rearden et al., 1993). 


Homeobox genes. Homeobox (HOX) genes represent a major class of transcription 
factors whose origin preceded the radiation of the triploblastic metazoa (Finnerty 
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and Martindale, 1998) and which are directly involved in the control of embryogen- 
esis in animals as diverse as nematodes, insects and vertebrates. The expression of 
these genes is spatially and temporally regulated during embryonic development and 
plays an important role in establishing the system of positioning along the anterior- 
posterior axis of the embryo. HOX proteins share a common homeobox domain and 
influence gene transcription by binding to a 4 bp core sequence in the promoters of 
their target genes. 

The HOX genes appear to have been duplicated on several occasions during evo- 
lution, perhaps even before the origin of angiosperms, fungi and the metazoa 
(Bharathan et al., 1997). Early in vertebrate evolution, an ancestral HOX gene clus- 
ter present in invertebrates and the cephalochordate Amphioxus was duplicated 
twice probably by whole genome duplication (see Chapter 2, section 2.1; Bailey et 
al., 1997; Garcia-Fernàndez and Holland, 1994; Holland and Garcia-Fernàndez, 
1996; Schughart et al., 1989) to give rise to the four linkage groups (HOX A, HOX 
B, HOX C, HOX D). From studies of HOX gene number, this duplication event is 
likely to have occurred after the divergence of ray-finned and lobe-finned fishes 
but before the radiation of the teleosts (Amores et al., 1998). A scheme for the evo- 
lution of mammalian HOX genes is presented in Figure 4.16. It can be seen that 
cluster duplication must have been followed by the loss of some specific HOX 
genes and indeed this is evident from comparison of HOX gene number between 
clusters (Figure 4.17). The physical order of genes within the clusters has however 
been conserved during vertebrate evolution (Schughart et al., 1989) perhaps as a 
result of enhancer sharing between different HOX genes (Mann, 1997). 

In human, most of the HOX genes are located in four clusters containing 
between them 39 HOX genes (Acampora et al., 1989). The HOX A cluster is 
located at 7p14-p15 and contains 11 genes: HOXA1, HOXA2, HOXA3, HOXA4, 
HOXAS, HOXA6, HOXA7, HOXA9, HOXA10, HOXA11 and HOXA13. The 
HOX B cluster is located at 17q21-q22 and contains 10 genes: HOXB1, HOXB2, 
HOXB3, HOXB4, HOXBS, HOXB6, HOXB7, HOXB8, HOXB9 and HOXB13. 
The HOX C cluster is located at 12q13 and contains 9 genes HOXC4, HOXCS, 
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Figure 4.16. Model for the evolution of the homeobox gene clusters of eukaryotes 
(redrawn from Holland and Garcia-Fernandez, 1996). In Drosophila, the eight Hox genes 
are found in a single complex which is split between two clusters. AbdB: Abdominal-B. 
AbdA: Abdominal-A. Ubx: Ultrabithorax. Antp: Antennapedia. Scr: Sex combs reduced. 
Dfd: Deformed. Pb: Proboscipedia. Lab: Labial. In mammals, four Hox gene clusters are 
found on separate chromosomes. 
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Figure 4.17. Human HOX gene clusters. Physical distances are not to scale. 


HOXC6, HOXC8, HOXC9, HOXC10, HOXC11, HOXC12, and HOXC13. The 
HOX D cluster is located at 2q31-q32 and contains 9 genes: HOXD1, HOXD3, 
HOXD4, HOXD8, HOXD9, HOXD10, HOXD11, HOXD12, and HOXD13. In 
addition to the HOX gene clusters, other dispersed paralogous HOX genes are 
found in the human genome (e.g. MSX1, 4p16; HMX2, 10q25-q26) and may also 
be of functional importance (Brooke et al., 1998). 

The development of the vertebrate body plan both by elaboration of primitive 
chordate characters and the development of novel morphological characters 
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(e.g. cranial ganglia, teeth and bone) must have required substantial reprogram- 
ming of gene networks (Shubin et al., 1997). There are several different ways in 
which changes in the HOX genes have probably influenced morphological evo- 
lution: (i) expansion in the structural diversity of HOX genes within a given 
complex or class (Sharkey et al., 1997), (ii) expansion of the number of HOX 
complexes (Mayer et al., 1998), (iii) the loss of specific HOX genes (Aparicio et 
al., 1997), (iv) changes in the location, timing or level of HOX gene expression 
(Burke, 1995; Gellon and McGinnis, 1998), and (v) changes in the interactions 
between HOX proteins and their target genes, possibly as a result of mutational 
changes in cis-acting regulatory elements (Carroll, 1995). 


Integrin genes. The integrins are a family of heterodimeric membrane glyco- 
proteins that participate in cell adhesion and are involved in a wide range of 
cell-cell and cell-matrix interactions. The diversity and specificity of integrin 
function is paralleled by the structural diversity potentiated by the existence of 
at least 16 different alpha chains and 8 different beta chains. Although, in prin- 
ciple, the alpha and beta chains could associate in a multitude of ways, in prac- 
tice diversity is limited to certain combinations. These chains are encoded by a 
family of genes which are widely dispersed in the human genome (Table 4.3). 
The alpha chain integrins can themselves be divided into two subgroups by 
virtue of the insertion of an I domain of about 180 amino acids in the extracel- 
lular region. The I-integrin alpha chain genes UTGA1, ITGA2, ITGAD, 
ITGAM, ITGAL, and ITGAE) are thought to have arisen as a result of an early 
insertion into a non-I gene followed by gene duplication and divergence. The 
clustering of these genes on chromosomes 5 and 16 (Table 4.3) is thought to have 
resulted from relatively recent gene duplications (Wang et al., 1995). The non-I 
alpha chain genes UTGA2, ITGA3, ITGA4, ITGAS, ITGA6, ITGA7, ITGA8, 
ITGA9) are largely confined to clusters on chromosomes 2, 12, and 17 (Wang et 
al., 1995; Table 4.3), locations which coincide closely with the homeobox (HOX) 
gene clusters (see Chapter 2, section 2.1 and chapter 4 section 4.2.1, G protein a 
subunit genes). This is suggestive of the occurrence of genomic or chromosomal 
duplications involving both types of gene cluster. Some of the beta chain 
(ITGB) genes are also located on chromosomes 2, 12, and 17 (Table 4.3) indicat- 
ing common ancestry with the non-I alpha chain genes. 


Keratin genes. The keratins, cytoskeletal proteins of the epithelium, belong to 
two families: type I (acidic) and type II (basic) (Fuchs et al., 1982). Since keratins 
are obligate heteropolymers, keratin intermediate filaments are composed of one 
type I and one type II polypeptide. This has led to consistent co-expression of 
type I-type II keratin pairs in different types of epithelial cells. The keratins are 
evolutionarily related to the family of intermediate filament proteins which 
includes vimentin and desmin (Hanukoglu and Fuchs, 1982). Type II keratins 
are no more homologous to the type I keratins than they are to other intermedi- 
ate proteins (Klinge et al., 1987) indicating that the type I and type II keratins 
diverged from the common ancestor of all intermediate filaments at about the 
same time. This common ancestral gene probably had its origins among the 
lower eukaryotes (Fuchs and Marchuk, 1983; Krieg et al., 1985). By contrast to 
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Table 4.3. Types of human integrin chain and the genes that encode them 


Integrin chain Gene symbol Chromosomal location 
a1 ITGA1 5 

a2 ITGA2 5q23-q31 
a3 ITGA3 17 

a4 ITGA4 2q31-q32 
a5 ITGA5 12q11-q13 
a6 ITGA6 2 

a7 ITGA7 12q13 

a8 ITGA8 ? 

a9 ITGA9 3p21.3 
allb /ITGA2B 17q21.3 
QE ITGAE ? 

aL ITGAL 16p11.2 
aM ITGAM 16p11.2 
aV ITGAV 2q31-q32 
aX ITGAX 16p11-p13 
aD ITGAD 16p11.2 

B1 /TGB1 10p11.2 
B2 ITGB2 21q22.3 
B3 /TGB3 17q21.3 
B4 ITGB4 17q11-qter 
B5 ITGB5 ? 

B6 ITGB6 2 

B7 ITGB7 12q13.1 

B8 ITGB8 7p 


the conserved central alpha-helical domains, the variable terminal glycine-rich 
domains of keratins 1 and 10 (comprising short 4-10 amino acid segments 
repeated 3-15 times) appear to have evolved by a series of tandem duplications 
brought about by unequal crossing over (Klinge et al., 1987). 

In the human genome, the genes encoding these keratin families have been 
found to be tightly clustered; at 17q12-q21 for the acidic keratins (e.g. KRT9, 
KRT10, KRT14, KRT15, KRT16, KRT17, and KRT19), and chromosome 12 for 
the basic keratins (e.g. KRT1, KRT2A, KRT5, KRT6A, KRT6B, and KRT8) 
(Milisavljevic et al., 1996; Table 4.4). There is one exception to this rule: KRT18 is 
a type I keratin gene but is located on chromosome 12q11-q13. There is some evi- 
dence for gene conversion between type I and type II keratin genes which could 
have facilitated their coevolution (Klinge et al., 1987). Concerted gene duplica- 
tions also provide evidence for coevolution of type I and type II keratin genes 
(Blumenberg, 1988). Thus, in both families, the genes expressed in the embryo 
duplicated and diverged first, followed by the genes expressed in various differ- 
entiated cells. Further gene duplications gave rise to the hair keratin genes 
(Blumenberg, 1988; Powell et al., 1992; Rogers et al., 1998; Table 4.4). This para- 
llelism of gene duplication cannot be explained by gene conversion but as yet the 
underlying mechanism which apparently allows duplications in one family to 
influence duplications in the other is still unclear. Coevolution may have been 
driven by the obligate heteropolymer status of the proteins. The tight regulation 
of the coordinate expression of the type I and type II keratin genes implies that 
unbalanced production is likely to be deleterious. It may therefore follow that the 
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duplication and functional divergence of a type I keratin gene might lead to a spe- 
cific type II gene duplication being selectively favored. 


Genes of the major histocompatibility complex. The major histocompatibility 
complex (MHC) region on chromosome 6p21.3 spans 4 Mb DNA and contains the 
genes encoding the human leukocyte antigens (HLA), a large family of cell surface 
glycoproteins (Figure 4.18). The function of the MHC proteins, members of the 
immunoglobulin superfamily (Section 4.2.4, Immunoglobulin genes), is to present 
peptides to T cells. The MHC family is divided into classes I and II whose members 
differ in terms of their structure, function, expression and polymorphism (Hughes, 
1996). Class I molecules (the classical transplantation antigens) are expressed on 
most cells whereas the expression of class II molecules is confined to antigen-pre- 
senting cells. Both class I and II molecules are heterodimers with four extracellular 
domains; a1, a2 and «3 associated with B2-microglobulin (class I) and a1, «2, B1 
and £2 (class II). These molecules bind both self- and pathogen-derived peptides in 
a surface groove (peptide-binding region) and present these to cytotoxic T cells that 
lyse the pathogen-infected cell, thereby limiting host infection. 


Table 4.4. Types of human keratin and the genes that encode them 


Keratin type Gene Chromosomal location 
Keratin 1 KRT1 12q11-q13 
Keratin 2A KRT2A 12q11-q13 
Keratin 3 KRT3 12q12-q13 
Keratin 4 KRT4 12p12-q11 
Keratin 5 KRT5 12q 
Keratin 6A KRT6A 12q12-q21 
Keratin 6B KRT6B 12 

Keratin 7 KRT7 12q12-q21 
Keratin 8 KRT8 12 

Keratin 9 KRT9 17q21 
Keratin 10 KRT10 17q21-q23 
Keratin 12 KRT12 17q11-q12 
Keratin 13 KRT13 17q21-q23 
Keratin 14 KRT14 17q12-q21 
Keratin 15 KRT15 17q21-q23 
Keratin 16 KRT16 17q12-q21 
Keratin 17 KRT17 17q12-q21 
Keratin 18 KRT18 12q11-q13 
Keratin 19 KRT19 17q21-q23 
Keratin, hair, acidic 1 KRTHA1 17q12-q21 
Keratin, hair, acidic 2 KRTHA2 17q12-q21 
Keratin, hair, acidic 3A KRTHA3A 17q12-q21 
Keratin, hair, acidic 3B KRTHA3B 17q12-q21 
Keratin, hair, acidic 4 KRTHA4 ? 

Keratin, hair, acidic 5 KRTHA5 17q12-q21 
Keratin, hair, basic 1 KRTHB1 12q13 
Keratin, hair, basic 2 KRTHB2 ? 

Keratin, hair, basic 3 KRTHB3 12q12-q13 
Keratin, hair, basic 4 KRTHB4 ? 

Keratin, hair, basic 5 KRTHB5S 12q12-q13 


Keratin, hair, basic 6 KRTHB6 12q13 
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Figure 4.18. Map of the human histocompatibility complex loci on 6p21.3. 


In humans, the class I loci are HLA-A, HLA-B, and HLA-C (classical or type Ia) 
and HLA-E, HLA-F and HLA-G (non-classical or type Ib) (Figure 4.18). The class 
I genes are interspersed with five MHC class I-related (MIC) loci (MICA, MICB, 
MICC, MICD, and MICE), five full-length pseudogenes (HLA-H, HLA-f, HLA- 
K, HLA-L, and HLA-X) plus several truncated pseudogenes and remnants (Figure 
4.19). The 5 MIC genes are distantly related to the hemochromatosis (HFE; 
6p21.3) gene. The class II loci are clustered into three regions, HLA-DR (HLA- 
DRA, HLA-DRB1, HLA-DRB2, HLA-DRB3, HLA-DRB4, and HLA-DRBS; 
NB. HLA-DRB3, HLA-DRB4, and HLA-DRBS may be variably present or 
absent depending upon the haplotype), HLA-DQ (HLA-DQAI1, HLA-DQA2, 
ALA-DQB1, HLA-DQB2, HLA-DQB3), and HLA-DP (HLA-DPA1, HLA- 
DPB1, HLA-DNA) (Figure 4.18). The class I and II regions are separated by a 
gene-dense 1100 kb region containing a number of so-called type III genes 
including BF, C2, C4A, C4B, and TNF. 

The ancestral MHC molecule appears to have been assembled by combining its 
three constituent domains (peptide-binding domain, immunoglobulin-like 
domain and membrane-anchoring domain) by exon shuffling (Figure 4.20). 
Hughes and Nei (1993) estimated that the class II A and B genes diverged more 
than 500 Myrs ago. Class I genes then emerged by duplication and divergence 
during the primate radiation of the last 60 Myrs (Hughes and Yeager, 1997; 
Kulski et al., 1997). The putative ancestral duplication(s) early in vertebrate evo- 
lution (Chapter 2, section 2.1) may well have provided an impetus to the emer- 
gence of the HLA complex in that the consequent redundancy could have created 
opportunities for the emergence of a variety of accessory and effector molecules 
(Kasahara et al., 1997). A scheme for the subsequent evolution of these genes is 
presented in Figure 4.21. 
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Figure 4.19. Map of the class I region of the human histocompatibility complex (redrawn 
from Wells and Parham, 1996). 
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Figure 4.20. Postulated origin of MHC class I and class II genes (redrawn from Klein 
and O’hUigen, 1993). An exon encoding a soluble immunoglobulin-like domain (ILD) is 
joined with an exon encoding a membrane-anchoring domain (MAD) to produce an 
ancestral immunoglobulin-family gene. This is joined by an exon encoding a peptide- 
binding domain (PBD) to produce an ancestral class II-like gene, assumed to be of the A 
variety (coding for the a-chain). Duplication and deletion events, along with sequence 
divergence would produce the present day array of class I A and class II A and B genes. 
The class I B gene is in fact the B2-microglobin (B2M; 15q21-q22) gene, which is not 
linked to the main MHC complex. 
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So far, the MHC appears to be confined to vertebrates (Klein et al., 1993; Klein 
and Sato 1998). In a phylogenetic analysis of class I genes from different mam- 
malian orders, Hughes and Nei (1989b) found that the class Ib genes clustered 
with the class Ia genes. This is explicable either by postulating an independent 
origin for the class Ib genes in different orders of mammals (Hughes, 1991) or by 
invoking the homogenizing influence of gene conversion between class I loci 
within each mammalian order (Rada et al., 1990). Class I genes evolved by a 
process of repeated gene duplication followed by a reduction in the level of 
expression of some genes, and the transcriptional silencing or deletion of others 
(Watkins, 1995). This explains why orthologous relationships are found among 
class I loci of mammals of the same order but not among mammals of different 
orders. By contrast to the class I loci, orthologous relationships between mam- 
malian class II loci are the rule rather than the exception, a consequence of the 
early origin of the class II MHC loci prior to the mammalian radiation (Hughes 
and Nei, 1990). 

The MHC class I genes are conserved in the great apes and Old World monkeys. 
Thus orthologues of HLA-A, HLA-B, HLA-E, HLA-F, and HLA-G are present in 
macaques although the HLA-C locus has been found only in chimpanzees and 
gorillas (Watkins 1995; Lienert and Parham, 1996). The MHC class I genes of New 
World monkeys are similar to the HLA-G and HLA-F genes (Watkins, 1995). The 
subfamily Callitrichinae (tamarins and marmosets) manifests an unusually high 
rate of turnover of class I MHC loci with different sets of nonorthologous MHC 
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Figure 4.21. Evolution of the class I MHC genes by serial duplication (after Klein et al., 
1998). Capital letters denote HLA loci, small letters MIC loci. Each rectangle represents 
one locus. Multiple letters within a rectangle denote ancestors of genes specified by 
individual letters. Short arrows indicate transcriptional orientation of loci whilst long 
arrows denote transpositions of loci. 
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class I genes being expressed (Cadavid et al., 1997). It would thus appear that class 
I genes have been differentially duplicated, deleted and amplified in different pri- 
mate lineages (Klein et al., 1998). 

There is an extremely high level of polymorphism at certain MHC loci, some 20 
times higher than the average level of nucleotide polymorphism found in the 
human genome. New MHC sequence variants have arisen by single base-pair sub- 
stitution, reciprocal recombination or gene conversion (Ohta 1997; Watkins et al., 
1991; Yeager and Hughes, 1999). Gene conversion at the MHC locus can occur at 
relatively high frequency (10+ per locus per generation; Zangenberg et al., 1995) 
but, on its own, it is insufficient to account for the level of polymorphism observed. 
The polymorphisms have gradually accumulated during human evolution and 
have probably been maintained at high frequency by a variety of means, one of the 
most important being overdominant selection in which the heterozygote responds 
better to challenge than either homozygote (Li, 1997; Wells and Parham, 1996). 

Certain MHC alleles are very ancient. Indeed, it would appear that some of the 
class I MHC allelic lineages are shared by human, chimpanzee and gorilla, imply- 
ing that these polymorphisms were present in the common ancestor of the three 
species (Fan et al., 1989; Gyllensten et al., 1991; Kupfermann et al., 1992; Lawlor 
et al., 1988; 1991; Mayer et al., 1988; 1992). Similar findings have been reported in 
the Cercopithecinae (Castro et al., 1996). The long-term persistence of such trans- 
species polymorphism is potentially explicable by overdominant selection because 
neutral alleles are unlikely to survive the process of speciation (Klein et al., 1993). 
Selection would be expected to act on MHC alleles that differed in their ability to 
bind and present foreign peptides. In a population exposed to pathogens, an indi- 
vidual heterozygous at many MHC loci would be able to present a larger number 
of foreign peptides than a homozygote and might therefore be resistant to a wider 
range of pathogens (Parham and Ohta, 1996). Support for the overdominant selec- 
tion hypothesis has come from studies (Hughes and Nei, 1988, 1989a; Ohta, 1991) 
that noted a significantly higher rate of nonsynonymous than synonymous 
nucleotide substitution in the peptide-binding region (antigen recognition site), 
strong evidence for the action of positive selection (Ayala et al., 1994). It should 
however be noted that negative or purifying selection may also operate on the 
HLA system so as to reduce diversity. One example of this is found in the peptide- 
binding region of the HLA-E gene in New World monkeys (Knapp et al., 1998). 

The high frequency of polymorphism has probably also been maintained by 
other means such as temporal variation in selection pressure driven by changes in 
pathogens. Rare allele advantage (frequency-dependent selection) may play a role 
in that individuals with a rare MHC allele may respond better to challenge from 
new pathogen variants that have evolved in such a way as to evade the products of 
the more common MHC alleles. Population size and structure may also be impor- 
tant since diversity may be promoted by high effective population size or as a 
result of the agglomeration of smaller populations each bearing a few different 
alleles (Wells and Parham, 1996). 

A high level of polymorphism can also be maintained by a ‘genetic hitch- 
hiker’ effect. Thus the occurrence of at least 11 olfactory receptor genes 
(OR2C1) within the MHC complex (Fan et al., 1995) may potentiate the selection 
of specific receptor alleles matched to a subset of MHC-determined odorants, 
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such that those haplotypes that carry certain matched alleles would have a selec- 
tive advantage in the population (Gruen and Weissman, 1997). If odorant dis- 
crimination is learned, this could have implications for the population genetics 
of species in which odorant-mediated kin recognition is important. 
Intriguingly, Wedekind et al. (1995) have claimed there to be evidence in 
humans for a preference for particular MHC-linked variations in sweat odors. 

The common marmoset (Callithnx jacchus), a New World primate, possesses 
limited MHC class II variability as a result of the inactivation of the MHC-DP 
region and limited polymorphism at the MHC-DR and -DQ loci (Antunes et al., 
1998). This limited MHC class II repertoire could play a role in the apparently 
increased susceptibility of this species to viral, bacterial, protozoan and helminth 
infections. 


Mucin genes. Many epithelial tissues such as trachea, mammary gland, pancreas, 
stomach, cervix and intestine produce high molecular weight glycoproteins 
known as mucins which are the major proteins of mucus. The mucins display 
only limited homologies with one another (Desseyn et al., 1997a) but do share the 
property of containing extensive tandemly repetitive regions. These regions can 
vary in length from 8 amino acid residues in MUCSAC to 169 residues in MUC6 
and can be highly polymorphic. Several of the human mucin genes are located 
within a 400 kb cluster on chromosome 11p15 (MUC2, MUCSAC, MUCSB, 
MUC6), consistent with a series of successive gene duplications (Gum 1992), 
whereas others are solitary (MUCI, 1q21; MUC3, 7q22; MUC4, 3q29; MUC7, 
4q13-q21; MUC8, 12q24) (Pigny et al., 1996). Although the human mucins dis- 
play only a limited degree of homology with one another, some mucin genes pos- 
sess exons of similar length and distribution consistent with their having evolved 
from a common ancestor (Buisine et al., 1998; Desseyn et al., 1997a, 1998). 

As noted above, the mucin genes often contain internal tandemly repetitive 
domains. Thus the MUC2 gene contains two regions with a high degree of inter- 
nal homology but no homology with each other. The first region comprises mul- 
tiple 48 bp repeats interrupted by 21-24 bp segments whilst the second region is 
composed of 69 bp repeats arranged in a tandem array of up to 115 copies 
(Toribara et al., 1991). The MUCSB gene contains an extremely large (10.7 kb) 
exon which encodes a 3570 amino acid protein that contains 19 subdomains 
which can be grouped into four larger composite units of 528 amino acids (‘super- 
repeats’) (Desseyn et al., 1997b). Similarly, the MUC4 gene contains an uninter- 
rupted 18 kb exon encoding about 380 units of length 48 bp (Nollet et al., 1998). 
Presumably these enormous exons have gradually become extended by serial 
internal duplication events that have not altered the splicing pattern, merely the 
length of the exon to be spliced. A model for the evolution of these complex and 
highly variable genes is presented in Figure 4.22. 


Genes encoding RNA-binding proteins. RNA-binding proteins are involved in 
a wide range of biological functions including mRNA splicing, processing and 
translation. These proteins constitute a family insofar as they contain RNA-bind- 
ing domains (Figure 4.23) which share a common evolutionary origin that 
predates the divergence of prokaryotes and eukaryotes (Fukami-Kobayashi et al., 
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Figure 4.22. Hypothetical scheme for the evolution of the MUC6, MUC2, MUCSAC, and 
MUCSB mucin genes from a common ancestor (redrawn from Desseyn et al., 1998). Unique 
domains flanking the tandem repeat arrays, conserved in each protein, are represented by 
rectangular or square boxes. A triangle represents the Cys sub-domain. The repetitive 
central domains are represented by the following symbols: circles (closed, empty and 
hatched) for the 23 amino acid tandem repeats of MUCZ, the 8 amino acid tandem repeats 
of MUCSAC, and the 29 amino acid tandem repeats of MUCSB. The oval represents the 
irregular 16 amino acid tandem repeat polypeptide of MUC2. Diamonds represent the 
MUCSB R-end sub-domain type. t, r, and s denote unknown numbers of repeats. 


1993). Examples of human genes encoding RNA-binding proteins include the 70 
kDa nuclear ribonucleoprotein (SNRP70; 19q13.3) gene, the heterogeneous 
nuclear ribonuclear riboprotein Al (HNRPA1; 12q13.1) gene, the Arg/Ser-rich 
splicing factor (SFRS2; chromosome 17) gene, the poly(A) binding protein 
(PABPL1; 3q22-q25, 12q13-q14, 13q12-q13) genes, and the NCL (2q12-qter) gene 
which encodes nucleolin, a protein involved in the control of transcription of 
rRNA genes and in the nucleocytoplasmic transport of ribosomal components. 


Sulfatase genes. Sulfatases catalyze the hydrolysis of sulfate ester bonds in a 
variety of substrates. Sulfatase genes have been described in lower eukaryotes and 
comprise an evolutionarily conserved multigene family in human (Parenti et al., 
1997): arylsulfatase A (ARSA; 22q13), arylsulfatase B (ARSB; 5q11-q14), arylsul- 
fatase D (ARSD; Xp22.3), arylsulfatase E (ARSE; Xp22.3), arylsulfatase F 
(ARSF; Xp22.3) proximal to the pseudoautosomal boundary, steroid sulfatase 
(STS; Xp22.3) 4 Mb distant from the ARSD/ARSE/ARSF cluster, iduronate- 
2-sulfatase (IDS; Xq27-28), galactose 6-sulfatase (GALNS; 16q24) and 
glucosamine 6-sulfatase (GNS; 12q14). The chromosomally dispersed sulfatase 
genes exhibit a relatively low degree of sequence identity consistent with these 
genes having emerged comparatively early during evolution. By contrast, the 
four genes located at Xp22.3, within the pseudoautosomal region, are more simi- 
lar in sequence, share a very similar exon-intron organization and possess homo- 
logues on the Y chromosome (Meroni et al., 1996), consistent with duplication 
events that occurred before the X and Y copies of these genes started to diverge 
(i.e. while they were still pseudoautosomal). 
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Figure 4.23. Evolutionary schema for various types of RNA-binding protein (redrawn 
from Fukami-Kobayashi et al., 1993). PABP: poly(A) binding protein. hnRNP: 
heterogeneous nuclear ribonucleoprotein. snRNP: small nuclear ribonucleoprotein. 


Genes encoding tRNAs and aminoacyl-tRNA synthetases. tRNA mole- 
cules perform a central role in protein biosynthesis by potentiating the transfer 
of amino acids to the ribosome. Each amino acid is bound to the tRNA by an 
aminoacyl-tRNA synthetase. Each tRNA contains sequence elements (located 
in the anticodon, acceptor stem or the discriminator base at position 73) that 
are unambiguously recognized by its cognate synthetase. The tRNA-tRNA 
synthetase complex, which also includes an elongation factor and GTP, enters 
the ribosome where the anticodon of the tRNA interacts specifically with one 
of the codon triplets of the mRNA. Subsequently, the amino acid is covalently 
bound to the C-terminal of the nascent polypeptide chain. tRNAs have similar 
sizes and tertiary structures (Mans et al., 1991) but are subdivided into tsoac- 
cepting groups which, despite having different sequences, recognize the same 
amino acid. 

It is unclear how many functional tRNA genes are present in the human 
genome. An early saturation hybridization study put the total number of tRNA 
genes at about 1300 per haploid human genome, an average of 65 copies for each 
tRNA species (Hatlen and Attardi, 1971). Chromosomally allocated human 
tRNA genes include TRAN (alanine, 6p21-p22), TRE (glutamic acid, 1p36), 
TRG] (glycine, chromosome 1), TRKI (lysine, 17p13), TRLJ (leucine 1, 14q11- 
ql2), TRL2 (leucine 2, 17p13), TRMII and TRMI2 (methionine, 6p23-q12), 
TRN (asparagine, 1p36), TRPJ and TRP2 (proline 1 and 2, 14q11-q12), TRP3 
(proline 3, chromosome 5), TRQJ (glutamine 1, 17p13), TRRI (arginine 1, 
17p13), TRR3 (arginine 3, 6p21-p22), TRR4 (arginine 4, 6p21-p22), TRTI (thre- 
onine 1, chromosome 5), and TRT2 (threonine 2, 14q11-q12). Thus tRNAs from 
different isoaccepting groups often cluster together, for example 6p22 (TRAN, 
TRMII, TRMI2, TRR3, TRR4; Buckland et al., 1996), 14q11-q12 (TRL, TRPI, 
TRP2, TRT2; Chang et al., 1986) and 17p13 (TRKI, TRL2, TRQ1, TRR1; 
Morrison et al., 1991). Other tRNA genes are dispersed and, since these genes are 
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often flanked by short 8-12 bp direct terminal repeats, they may have arisen by 
RNA-mediated transposition (McBride et al., 1989). 

A human gene (TRSP) encoding an opal suppressor phosphoserine tRNA has 
been characterized on chromosome 19q13 (O’Neill et al., 1985). This tRNA is 
potentially capable of suppressing nonsense mutations since it is able to recognize 
a different codon than that corresponding to the amino acid it carries. This abil- 
ity is conferred upon it by dint of a base change in its anticodon which allows it to 
translate a UGA Stop codon and insert a serine at this position. 

Phylogenetic studies of tRNA genes have shown that members of isoaccepting 
groups do not invariably cluster together (Saks and Sampson, 1995). It would 
appear that during evolution, tRNAs have been recruited from one isoaccepting 
group to another by mutations in the anticodon (Saks and Sampson, 1995). 

The aminoacyl-tRNA synthetases constitute a large gene family in the human 
genome including glutaminyl/prolyl (EPRS; 1q32-q42), lysyl (KARS; 16q23- 
q24), alanyl (AARS; 16q22), arginyl (RARS; 5pter-q11), valyl (VARS1; 9), his- 
tidyl (HARS; chromosome 5), asparginyl (VARS; chromosome 18), threonyl 
(TARS; 5p13-cen), methionyl (MARS; chromosome 12), isoleucyl JARS; 9q21), 
glycyl (GARS; 7p15), tryptophanyl (WARS; 14q32), cysteinyl (CARS; 11p15.5) 
and leucyl (LARS; 5cen-q11). Whilst many of the members of this gene family are 
chromosomally widely dispersed, similar chromosomal locations for some mem- 
bers are suggestive of linkage. The aminoacyl-tRNA synthetases may be grouped 
into two distinct classes, each with ten members, based upon sequence data and 
structure (Cusack et al., 1991; Eriani et al., 1990). These groups are thought to 
have arisen from two progenitor aminoacyl-tRNA synthetases by gene duplica- 
tion and divergence (Nagel and Doolittle, 1995). Despite their ancient origin and 
their presence in both prokaryotes and eukaryotes, sequence similarity between 
aminoacyl-tRNA synthetases from eukaryotes and prokaryotes may be as low as 
15% (Nagel and Doolittle, 1995). 

Interestingly, there is a relationship between synthetase class and the 
nucleotide that is conserved at position 73. Eight of the ten tRNAs that are 
aminoacylated by class I synthetases have an alanine residue at this position 
whereas those that are aminoacylated by class II synthetases manifest greater 
nucleotide diversity (Saks and Sampson, 1995). It is clear that the genes for both 
tRNAs and aminoacyl-tRNA synthetases have a very long evolutionary history. 
Their origins and the means by which they may have coevolved are as yet unclear 
(Saks and Sampson, 1995) although recent work has suggested that the tRNA syn- 
thetases may have been preceded by their tRNAs (Ribas de Pouplana et al., 1998). 


Ubiquitin genes. The ubiquitin genes are extremely highly conserved 76 amino 
acid proteins which are required for ATP-dependent nonlysosomal intracellular 
protein degradation of defective proteins and proteins with a rapid turnover 
(Schlesinger and Bond 1987). Although found in all eukaryotes, they have not so 
far been found in prokaryotes. 

The human genome contains multiple ubiquitin-related sequences most of 
which are processed pseudogenes (Schlesinger and Bond, 1987). There are how- 
ever at least four functional ubiquitin genes which together have provided an 
interesting challenge to carefully crafted general definitions of the gene (Chapter 1, 
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section 1.2.1). These genes belong to one of two distinct classes. The first is exem- 
plified by the ubiquitin C (UBC; 12q24) gene which contains 9 direct repeats of 
the ubiquitin amino acid sequence with no spacer regions and no introns (Wiborg 
et al., 1985). The UBC locus has a single termination codon and yields a 2500 
nucleotide mRNA which upon translation generates a polyubiquitin precursor 
molecule which is post-translationally cleaved to active ubiquitin. Unequal cross- 
ing over events at the UBC locus have been responsible for polymorphism in ubiq- 
uitin repeat unit number (Baker and Board, 1989). A second human polyubiquitin 
gene (UBB, 17p11-p12) with 3 repeat units yields a 1000 nucleotide mRNA (Webb 
et al., 1990). The second class comprises ubiquitin fusion genes which encode a sin- 
gle ubiquitin fused in-frame to a ribosomal protein of either 52 (UBA52; 19p13; 
Baker and Board, 1991; 1992) or 76-80 (UBA80) amino acids (Lund et al., 1985). 

The number of nucleotide differences between ubiquitin repeats within a given 
species is very low as compared to those between species (Sharp and Li, 1987). 
Thus, repeats within a locus share a more recent common ancestor than any two 
repeats in different species. This finding is explicable in terms of concerted evo- 
lution, probably mediated by unequal crossing over or gene conversion (Sharp 
and Li, 1987; Vrana and Wheeler, 1996). In human, concerted evolution is evi- 
dent both within and between ubiquitin loci but appears to occur at a higher rate 
for some repeats than others (Tan et al., 1993). 


An overview of the evolution of multigene families in eukaryotes. During 
eukaryotic evolution, gene families have been created by sequential, phased 
rounds of gene duplication with specific genes becoming duplicated as a result of 
whole genome duplications, subchromosomal regional duplications, and through 
the rather more discrete duplication of individual gene loci. Since the number of 
genes in a gene family varies quite widely, we may surmise that some sequences 
are more predisposed to duplicate than others. As a consequence of their common 
origin, multigene family members usually have a similar structure both in terms 
of their nucleotide sequences and exon-intron organization (although the gain or 
loss of some introns in some members can serve to obscure their common ances- 
try). Whereas some genes have retained their syntenic relationship with each 
other after duplication and have remained chromosomally linked for relatively 
long periods of evolutionary time, others have been translocated to another chro- 
mosomal location. In some gene families, new rounds of gene duplication have 
then led to the formation of gene clusters. Once diversification of gene clusters, 
sub-clusters and individual genes had occurred through the acquisition of com- 
paratively subtle mutational changes, the duplication of entire gene clusters as 
well as portions of clusters would then have led to the emergence of a sub-family 
structure within the gene family. 

Diversification of individual genes within multigene families has occurred in a 
host of different ways in different lineages. Amino acid substitutions may appear 
to constitute relatively subtle structural changes but these changes can be quite 
dramatic in functional terms if, for example, substrate specificity is altered. 
Diversification can also proceed by internal duplication and deletion (e.g. gain or 
loss of individual exons), the insertion or removal of individual amino acid 
residues, repeat expansion and by the more complex processes of gene conversion, 
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fusion and recombination. Family members have sometimes been inactivated or 
lost so that in some cases, the immediate ancestors of extant genes may now no 
longer exist (molecular ‘missing links’). In others, promoter changes have led to 
the emergence of differences in gene expression either in terms of tissue specificity 
or level of expression or in responsiveness to inductive stimuli whether of envi- 
ronmental (e.g. temperature), systemic (e.g. hormonal) or intra-cellular origin (e.g. 
transcription factors). Some multigene family members have evolved more quickly 
than others whether as a result of selection pressure or the stochastic processes of 
neutralist evolution. What is common to all the gene families cited above is the 
principle that gene duplication has created redundancy which has then allowed 
evolutionary experimentation through diversification, ultimately leading to the 
recruitment of a new generation of genes encoding proteins with novel properties. 


4.2.2 Highly repetitive multigene families 


Histone genes. Histones are basic nuclear proteins which make up the nucleo- 
some within the chromatin fibre. Pairs of H2A, H2B, H3, and H4 form the 
octamer with H1 being responsible for linking the nucleosomes and potentiating 
the formation of higher order chromosome structure. In mammals, the histones 
may be sub-divided into three types: (i) main-type replication-dependent his- 
tones, (ii) replication-independent ‘replacement’ histones, and (iii) tissue-specific 
histones. The genes encoding the replication-dependent and tissue-specific his- 
tones lack introns, give rise to non-polyadenylated mRNAs, contain 3’ elements 
with dyadic symmetry essential for mRNA processing, and are chromosomally 
clustered. By contrast, the genes encoding the replacement histones can contain 
introns, give rise to polyadenylated mRNAs and are solitary rather than clustered 
(Brush et al., 1985; Doenecke et al., 1994). 

Three main clusters of histone genes are apparent in the human genome and 
contain between them about 60 genes. Two clusters, located at 6p21.3, are sepa- 
rated by ~2 Mb and contain all replication-dependent H1 histone (H1F1, HIF 2, 
H1F3, H1F4, H1F5; Figure 4.24) genes and surrounding core histone (H2A, 
A2B, H3, H4) genes (Albig et al., 1993; Albig and Doenecke, 1997; Albig et al., 
1997a, 1997b). The other cluster at 1q21 is smaller consisting of at least four core 
histone genes. Various solitary replacement histone genes have been located on 
different chromosomes, for example H1° (H1F0; 22q13; Albig et al., 1993), 
H2A.X (A2AX; 11q23; Ivanova et al., 1994), H2A.Z (H2AZ; 4q24; Popescu et al., 
1994) and H3.3B (A3F3B; 17q25; Albig et al., 1995). Finally, testis-specific his- 
tone genes H3F3A (Albig et al., 1996) and HIFT (Albig et al., 1997) have been 
localized to chromosomes 1q42 and 6p21, respectively. 

Histones are very ancient proteins as evidenced by homologies between 
prokaryotic and eukaryotic histones H2A/B, H3, and H4 (Slesarev et al., 1998; 
Ouzounis and Kyrpides, 1996). Interestingly, homology also exists between the 
core histones and the CCAAT-binding factor (Ouzounis and Kyrpides, 1996) sug- 
gesting that transcriptional regulation and nucleosomal packing may have been 
intimately related for a very considerable period of evolutionary time. 

In lower eukaryotes, histone genes usually occur in long tandemly repetitive 
arrays but in mammals, although the genes are clustered, they are less ordered and 
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Figure 4.24. Organization of the human histone gene cluster on chromosome 6 (after Albig 
et al., 1997a).Solid rectangles: histone genes. Open rectangles: pseudogenes. Orientations of 
histone genes are denoted by arrowheads. Restriction sites are shown as short vertical bars 
above (EcoRI) or below (MluI) the line. 


do not occur as tandem repeats (Heintz et al., 1981). A map of part of the major 
histone cluster on human chromosome 6 is shown in Figure 4.24 and it is evident 
that the arrangement of the histone genes is fairly random; the only regularity is 
the arrangement of the divergently orientated H2A and H2B gene pairs. For both 
main-type and replacement histone types, chromosomal localization, gene num- 
ber and gene organization appear to be fairly well conserved between human and 
mouse (Albig and Doenecke, 1997; Albig et al., 1997). This indicates that the sep- 
aration of these two groups of histone genes must have occurred at an early stage 
in mammalian evolution. Chromosomal clustering may be a prerequisite for coor- 
dinate regulation. An unexpectedly high degree of within-species homogeneity of 
tandemly repeated histone genes is explicable in terms of unequal crossing over 
and/or gene conversion (Maxson et al., 1983). 

The residual sequence similarity (pairwise 15-20%) exhibited by the four core 
histones together with their structural similarity argues strongly for descent from 
a common ancestor, albeit a very ancient one (Ramakrishnan, 1995). A phyloge- 
netic analysis of the core histones has suggested that the histones that form 
dimers (viz. H2A/H2B and H3/H4) have very similar trees and appear to have co- 
evolved (Thatcher and Gorovsky, 1994). H3 and H4 have evolved ~10-fold more 
slowly than H2A and H2B and are very highly conserved (Thatcher and 
Gorovsky, 1994). The H2A.Z variant arose early and appears to be more highly 
conserved than the main-type 2A histone whilst H3.3 variants have arisen inde- 
pendently on many occasions (Thatcher and Gorovsky, 1994). By contrast, 
H2A.X arose comparatively recently during the evolution of the vertebrates 
(Thatcher and Gorovsky, 1994). 


Ribosomal RNA genes. The ribosomal RNA (rRNA) genes occur as 300-400 
copies in the human genome. They are organized in tandemly repeated blocks 
which are located on the acrocentric chromosomes: RNRI (13p12), RNR2 
(14p12), RNR3 (15p12), RNR4 (21p12), and RNRS (22p12). The 44 kb rDNA 
repeat unit contains a 13.3 kb RNA polymerase I-transcribed portion and a 31 kb 
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nontranscribed spacer region (Figure 4.25; Gonzalez et al., 1992). The 13 kb (45S) 
transcript is then processed to generate the 28S, 18S and 5S mature rRNA mole- 
cules with respective lengths of 1.8 kb, 0.15 kb, and 5.8 kb. 

The 28S rRNA contains both conserved and variable regions. The former are 
constant in size and sequence whereas the latter are variable (Gorski et al., 1987) 
and exhibit species-specific differences between primates (Gonzalez et al., 1988; 
1990). The variable regions potentiate hairpin loop formation and are essential for 
rRNA secondary structure (Gorski et al., 1987). 

The number of rRNA genes on a given chromosome varies in the population, 
while in any one individual, rRNA gene number varies between clusters 
(Arnheim et al., 1980). The rRNA clusters appear to have evolved in concerted 
fashion. Thus rRNA genes on non-homologous chromosomes are far more simi- 
lar to one another than would be expected if these genes had evolved indepen- 
dently. Several mechanisms have been proposed to account for this relative 
homogeneity of rDNA including unequal crossing over between rDNA sequences 
on nonhomologous chromosomes (Worton et al., 1988). Gene conversion may 
then act so as to homogenize rDNA repeats within each cluster. On the basis of 
linkage disequilibrium data, Seperack et al. (1988) have suggested that gene con- 
version and unequal crossing over may also occur within a chromosome (i.e. by 
sister chromatid exchanges). 


5S ribosomal RNA genes. The human 5S rRNA (RN5S1) genes occur in clus- 
ters of 2.3 kb and 1.6 kb tandem repeat units which differ from each other by 
virtue of a deletion in the 3’ flanking region (Sorensen and Frederiksen, 1991). 
There appear to be 300-400 copies of 5S rRNA genes per haploid human genome. 
Between 100 and 150 genes are derived from the 2.3 kb cluster (probably located 
at chromosome 1q42), 5-10 from the 1.6 kb cluster whereas the remaining 
200-300 genes/gene variants are not found in repeat structures and may be dis- 
persed around the genome. In addition to these genes, a large number of 5S rRNA 
pseudogenes exist which brings the total number of 5S rRNA homologous 
sequences per haploid genome to around 2000. 


Small nuclear RNA genes. The human U2 snRNA (RNU2) genes are clustered 
in tandem arrays of between 6 and 30 copies at 17q21-q22 (Van Arsdell and 
Weiner, 1984; Westin et al., 1984). The primate U2 snRNA arrays have evolved in 
concerted fashion with each repeat being essentially homogeneous within a 
species although somewhat different between species, an observation consistent 
with the action of gene conversion (Liao and Weiner, 1995; Liao et al., 1997; 
Matera et al., 1990). By contrast to the situation in higher primates, the U2 snRNA 
genes of both the mouse and the prosimian, Galago crassicaudatus, are dispersed 
rather than clustered suggesting that the arrays characteristic of higher primates 
may have resulted from amplification of a common ancestral gene (Matera et al., 
1990). Once established in the simian lineage, however, the U2 tandem repeat 
array has remained at the same chromosomal locus through multiple speciation 
events over a period of >35 Myrs (Pavelitz et al., 1995). 

Some 30 copies of the human U1 snRNA (RNUI) genes are also clustered at 
1p36.3. This site is distinct from the cluster of U1 snRNA pseudogenes at 1q12-q22 
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that outnumber their functional counterparts by some 15-30 fold (Lindgren et al., 
1985). This contrasts with the situation found for the U2 snRNA genes, where the 
active genes outnumber the pseudogenes. 


4.2.3 Gene superfamilies 


The superfamily is used to describe a group of gene families whose individual 
members share a common evolutionary origin, possess common features within a 
family but differ with respect to certain other features between families. A selec- 
tion of some of the most important human gene superfamilies will be presented. 


Cadherin genes. The cadherins are membrane-associated glycoproteins which 
act as calcium-dependent cell adhesion molecules but which may also be involved 
in signal transduction. Cadherins may be classified into four groups (classical, 
desmosomal, protocadherins, cadherin-related proteins) which are structurally 
similar, each containing an extracellular domain consisting of between 4 and 30+ 
repeats of an 110 amino acid cadherin-specific motif. Although classical cad- 
herins appear to be confined to the vertebrates, the superfamily has ancient ori- 
gins with members present in organisms as diverse as nematodes and humans. 
The classical and desmosomal cadherins are thought to have evolved more 
recently by a process of duplication and divergence from the primordial proto- 
cadherin-like proteins (Suzuki, 1996). 

A cluster of human classical cadherin genes is present at chromosome 16q22.1 
(CDH1, CDH5, CDH3) with CDH13 and CDH15 located at 16q24 and CDH11 
not yet regionally localized on chromosome 16. Other genes encoding classical 
cadherins are present on 5p13-p14 (CDH12) and 18q11 (CDH2) whilst a proto- 
cadherin (PCDH7) gene has been mapped to 4p15. 


Cytochrome P450 genes. The cytochrome P450 enzymes comprise a large liver- 
expressed family of heme-containing electron transport molecules which are 
involved in the oxidative metabolism of a wide range of substrates including 
steroids, drugs and xenobiotics. This family can be subdivided into sub-families 
on the basis of structural and functional criteria and many are extremely poly- 
morphic, for example CYP2D6 (Marez et al., 1997). With nearly 40 CYP P450 
genes identified (and perhaps between 20 and 150 yet to be identified), this super- 
family is one of the larger superfamilies represented in the human genome: 
CYPIAI and CYPIA2 (15q22), CYPIB1 (2p21), CYP2A6, CYP2A7, CYP2A13 
(19q13.1), CYP2B6 and CYP2B7 (19q13.2), CYP2C8, CYP2C9, CYP2C10, 
CYP2C18, CYP2C19 (10q24), CYP2D6 (22q13.1), CYP2E (10q24-qter), 
CYP2FI1 (19q13.2), CYP272 (1p31), CYP3A4 (7q22.1), CYP4A11 (1), CYP4B1 
(1p12-p34), CYP7A] (8q11-q12), CYPIIA (15q23-q24), CYP11B1 and CYP11B2 
(8q21), CYP17 (10q24.3), CYP19 (15q21), CYP21 (6p21.3), CYP24 (20q13), 
CYP26AI1 (10q23-q24), CYP27A1 (2q33-qter), CYP27B1 (12q13.3-ql4) and 
CYPSI1 (7q21). Several gene clusters are apparent, for example on 10q24 and 
19q13, and this is likely to reflect a history of gene duplication and divergence 
(Hoffman et al., 1995; Nelson et al., 1996). A common origin may however also be 
reflected by the possession of a similar exon/intron arrangement as in the case of 
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the CYP17 and CYP21 genes or the CYP11, CYP24, and CYP27 genes. The phy- 
logeny of the cytochrome P450 superfamily has been explored to some extent and 
various attempts have been made to date the various gene duplication events 
which have given rise to extant genes (Degtyarenko and Archakov, 1993; Nelson 
and Strobel, 1987; Nebert et al., 1989). Much of the increase in CYP P450 gene 
number took place around 400 Myrs ago. This was the time when tetrapods first 
began to colonize the land and feed upon the plants that had become established 
in the late Silurian and early Devonian periods. Since these terrestrial plants 
probably contained toxic compounds, expansion of the CYP P450 gene family to 
provide a large set of detoxifying enzymes was probably an adaptive response to 
this chemical challenge. 

The CYP51 gene encodes sterol 14-a-demethylase which plays an important 
role in sterol biosynthesis in fungi and plants as well as animals. As such, CYP51 
is the only P450 family member which is recognizable across all eukaryotic phyla 
(Rozman et al., 1996). The CYP51 gene also appears to have its counterparts in 
prokaryotes (Aoyama et al., 1998) and the gene family may have originated before 
the divergence of the eukaryotic and prokaryotic kingdoms. When members of 
the various mammalian and fungal P450 gene families were aligned and com- 
pared (Rozman et al., 1996), more than 80 intron locations were identified. Since 
it is unlikely that all of these introns were present in the primordial eukaryotic 
P450 gene, it may be concluded that P450 gene structures have evolved very con- 
siderably over the last 2 billion years either by intron insertion (Chapter 3, section 
3.5) or intron sliding (Chapter 3, section 3.4). 


Cystatin genes. The cystatin superfamily comprises a number of proteins, many 
of which are cysteine protease inhibitors, but which have evolved to take on a 
variety of different physiological functions (Rawlings and Barrett, 1990). The 
family has emerged through a process of duplication and divergence from a pri- 
mordial gene which is thought to have existed more than 1200 Myrs ago 
(Rawlings and Barrett, 1990). Human cystatin superfamily genes belong to family 
1 [cystatins A and B (CSTA, 3cen-q21; CSTB, 21q22.3)], family 2 cystatins C, S, 
SA and SN (CST3, CST4, CST2, and CST1; 20p11.2; Thiesse et al., 1994), family 
3 kininogen (KNG, 3q21-qter) or family 4 histidine-rich glycoprotein and o,HS- 
glycoprotein (HRG, 3q27; AHSG, 3q27-q29). The evolutionary relationship of 
these genes remains to be unravelled (Brown and Dziegielewska, 1997; Müller- 
Esterl et al., 1985). 


G-protein-coupled receptor genes. The G-protein-coupled receptor (GPCR) 
superfamily can be separated into five different groups on the basis of their nat- 
ural ligands: (i) peptides and peptide hormones, for example endothelin receptors 
(EDNRB, 13q22; EDNRA, chromosome 4), adrenocorticotropic hormone recep- 
tor (MC2R; 18p11.2), angiotensin receptor (AGTRI; 3q21-q25), thyrotropin 
receptor (TSHR; 14q31), (ii) neurotransmitters, for example dopamine receptors 
(DRD1, 5q35; DRD2, 11q23; DRD3, 3q13; DRD4, 11p15; DRDS, 4p15-p16), 
(iii) other regulatory factors, for example thrombin receptor (F2R; 5q13), (iv) sen- 
sory stimuli, for example opsins (see Chapter 7, section 7.5.2, The visual pigments), 
and (v) unknown ligands. The GPCRs are evolutionarily and phylogenetically 
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related, most of them possessing a common motif of 7 transmembrane domains 
(Bockaert and Pin, 1999; Yokoyama and Starmer, 1996). A multitude of human 
genes encoding G-protein-coupled receptor with unknown ligands are also 
known and these are widely scattered around the genome, for example GPRI 
(15q21.6), GPR2 (17q21), GPR3 (1p34-p36), GPR4 (19q13), GPRS (3p21), GPR6 
(6q21-q22), GPR7 (10ql1-q21), GPR8 (20q13), GPR9 (8pl1l-p12), GPR10 
(10q25-q26), GPR12 (13q12), GPR13 (3p21-pter), GPR1S (3q11-q13), GPR18 
(13q32), GPR20 (8q24). 


Heat shock genes. The heat shock proteins are evolutionarily ubiquitous (‘uni- 
versal’) proteins that function as molecular chaperones under both physiological 
and stress conditions. These proteins recognize and stabilize partially folded pro- 
teins during the processes of translation, translocation across membranes and 
multimer assembly. Escherichia coli possesses one heat shock protein gene (dnaK), 
yeast possesses nine as does Drosophila. This superfamily can be subdivided into 
three major classes based upon the molecular weight and degree of homology of 
the proteins. 

The most highly conserved class is the HSP70 family represented in human by 
at least eleven genes: HSPAIA, HSPAIB and HSPAIL linked on chromosome 
6p21, HSPA2 (14q24), HSPA3 (chromosome 21), HSPA4 and HSPA9 linked on 
chromosome 5q31, HSPAS (9q34), HSPA6 and HSPA7 linked on chromosome 
lq, and HSPA8 (11q23-q25). Some members of the HSP70 family are ubiqui- 
tously expressed, others are tissue-specific, some are constitutively expressed, 
others are expressed only in response to stress (Giinther and Walter, 1994). The 
HSP70 proteins also differ in their subcellular localization, for example cytosol, 
nucleus, nucleolus, endoplasmic reticulum, mitochondrion (Günther and Walter, 
1994). The HSP70 genes are extremely ancient having arisen before the diversifi- 
cation of cellular life into bacteria, archaebacteria and eukaryotes. Indeed, 
sequence data from these proteins have been used to argue for the origin of 
eukaryotes via a fusion between archaebacteria and gram-negative bacteria (Gupta 
and Golding, 1993; Gupta and Singh, 1994). Not surprisingly in view of their uni- 
versality, they exhibit extreme evolutionary conservation (Boorstein et al., 1994). 
Chromosomal localization has also been conserved evolutionarily as evidenced by 
the linkage of three HSP70 genes to the HLA/MHC complex in both humans and 
rodents. 

The 90 kDa heat shock proteins comprise a second family of evolutionarily 
highly conserved proteins which have counterparts in bacteria as well as in all 
eukaryotes. Phylogenetic analysis has suggested that the first of the HSP90 gene 
duplications occurred very early on in the evolution of the eukaryotes (Gupta, 
1995). Human homologues of this gene family are dispersed in the genome and 
include HSPCA (1q21-q22) and HSPCB (6p12). The human genome also con- 
tains a group of heat shock proteins of molecular weight 15-30 kDa which are 
highly conserved and evolutionarily related to the o-crystallins (de Jong et al., 
1993; see Section 4.2.1). Interestingly, one of the human genes encoding heat 
shock protein 27 (HSPB2) is closely linked to the o-B-crystallin (CRYAB) gene 
on chromosome 11q22-q23. 
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Insulin and insulin-like growth factor genes. The insulin gene family has 
ancient origins among the primitive eukaryotes (Le Roith et al., 1980; Smit et al., 
1993). Human representatives of this superfamily include insulin (INS; 11p15.5), 
insulin receptor (INSR; 19p13), insulin-like growth factor I JGFI1; 12q22-q24), 
insulin-like growth factor II (GF2; 11p15.5) and the relaxins (RLNI and RLN2; 
9pter-q12). The origin of the IGF] and IGF2 genes, which encode important reg- 
ulators of growth and development, antedates the emergence of the first verte- 
brates, since they appeared early on in the evolution of the protochordates more 
than 600 Myrs ago (Chan et al., 1990; McRory and Sherwood, 1997; Nagamatsu et 
al., 1991; Upton et al., 1997). It is likely that in primitive organisms, insulin-like 
peptides functioned so as to promote the uptake and utilization of nutrients, and 
as a consequence, stimulated growth. With increasing complexity, nutrition 
and growth became uncoupled and the insulin family diversified into proteins 
and receptors with differing capacities for regulating metabolism and growth 
(Steiner et al., 1985). 


Interferon genes. The interferon superfamily of viral defence proteins comprises 
two main classes of gene (type I and type II). The type II interferons have only one 
member, interferon-y, encoded by a gene JFNG) on chromosome 12q14. By con- 
trast, type I interferons comprise several sub-families of genes most of which are 
clustered in a 400 kb region of chromosome 9p21. This cluster contains 13 inter- 
feron-a genes UFNAI, IFNA2, IFNA4, IFNA5, IFNA6, IFNA7, IFNA8, 
IFNA10, IFNA13, IFNA14, IFNA16, IFNA17 and IFNA21), a single interferon- 
Q gene (IFNW1), a single interferon-B gene (JFNB1) and several pseudogenes 
(Diaz et al., 1994; Figure 4.25). The genes are arranged in tandem and most of the 
functional genes are oriented with their 3’ ends pointing in a telomeric direction. 
There are two exceptions, JFNAI and IFNA8, which together with four pseudo- 
genes orient towards the centromere (Figure 4.26). This is consistent with the 
occurrence at some stage of an inverted duplication within the gene cluster with 
its breakpoint between JFNP12 and IFNP11. 

The evolutionary relationships of the various JENA family members have been 
determined by phylogenetic analysis (Golding and Glickman, 1985) and are con- 
sistent with the emergence of this gene family by a process of gene duplication 


Figure 4.25. Organization of the human ribosomal gene clusters (after Erickson and 
Schmickel, 1985). The repeat pattern consists of four EcoRI fragments: A (7.3 kb), B (6.1 
kb), C (11.7 kb), and D (16-19.6 kb) which together comprise a total repeat length of 
41.1-44.7 kb. The inverted triangle denotes an area of length variability. EcoRI fragments 
are indicated by arrows. The 45S precursor rRNA transcript is processed to yield the 
mature 18S, 5.8S, and 28S rRNAs. 
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Figure 4.26. Map of the human type-I interferon gene cluster on chromosome 9p22 (after 
Diaz et al., 1994). Solid rectangles: genes; open rectangles: pseudogenes. 


and divergence (Miyata et al., 1985). The proximal genes, IFNA1, IFNA2, IFNAS, 
IFNA6, IFNA13, appear more closely related to each other than they are to the dis- 
tal genes, IFNA4, IFNA7, IFNA10, IFNA16, IFNA17, and IFNA21, whilst IFNA8& 
is equally divergent from both groups (Gillespie and Carter, 1983; Henco et al., 
1985; Miyata and Hayashida, 1982). Since these two groups are located at opposite 
ends of the gene cluster, this may reflect an early multi-gene duplication. The gene 
cluster appears to have evolved by gene duplication as a result of unequal crossing 
over involving tandem units of IFNA and IFNW/1 genes (Diaz et al., 1994) but 
divergence between gene sequences as well as subsequent gene deletions and dupli- 
cations have served to erase the evolutionary history of parts of the gene cluster 
(Diaz, 1995). The IFNA14 gene, located midway between the two groups, is simi- 
lar to the distal group in its 5’ half and the proximal group in its 3’ half (Diaz et al., 
1994). This is explicable either in terms of a unequal crossing over event between 
distal and proximal interferon genes or a gene conversion event which has cor- 
rected the 5’ half of the gene against a donor proximal gene. Several of the human 
IFNA genes differ from the other family members by one or more relatively subtle 
mutations. Golding and Glickman have shown in their insightful and prescient 
(1985) study that these changes are explicable in terms of their templation by the 
local DNA sequence environment. Thus, the in-frame deletion of a GAT codon in 
the IFNA2 gene may have been templated by a 5 bp inverted repeat 9 bp 5’ to the 
observed deletion (Figure 4.27a). Similarly, a 9 bp inverted repeat (separated by 25 
bp DNA) may have templated an AA to GT change in the 5’ flanking region of the 
interferon “a9” gene (Figure 4.27b) whilst a 16 bp direct repeat could have tem- 
plated five distinct sequence changes (two transitions, two transversions and a gua- 
nine insertion) as the product of one mutational event (Figure 4.27c). 

The type I interferon genes not only lack introns but also exhibit sequence 
homology indicating that they share a common ancestry (Miyata and Hayashida, 
1982). Since the spacer regions between primate JFNA genes still retain some 
sequence similarity, we may surmise that some of the gene duplication events 
have been relatively recent (Ullrich et al., 1982). Indeed, Miyata and Hayashida 
(1982) estimated that they arose within the last 26 Myrs. The duplicational expan- 
sion of the IFNA gene cluster may have been driven by the need to produce large 
quantities of interferon rapidly, by selection for a novel temporal or spatial pattern 
of expression, or by selection for a specialized novel function, perhaps associated 
with a variant receptor with lower affinity for its ligand (Diaz, 1995). 
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Figure 4.27. Evolution of the human q-interferon gene cluster; the role of template- 
mediated mutational changes (redrawn from Golding and Glickman, 1985). (a) An 
inverted repeat may have formed a template for the deletion of a GAT codon in the 
IFNA2 gene. (b) An inverted repeat may have formed a template for an AA to GT change 
within the 5’ flanking region of the interferon “9” gene. (c) A direct repeat may have 
directed five changes in the 5’ flanking region of the JFNA10 gene. The 16 bp repeat is 
marked by asterisks. The putative ancestral sequence is shown above, the descendent 


sequence below. 
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The absence of introns in the JFNA and IFNB genes probably reflects the orga- 
nization of the ancestral gene which could conceivably have been a retrotrans- 
posed copy of an intron-containing interferon gene such as that encoding 
interferon-y JFNG; 12q14) which contains three introns. The JFNA and IFNB 
genes also exhibit an axis of internal symmetry, presumably the result of an inter- 
nal duplication which must have occurred prior to the divergence of the two gene 
families (Erickson et al., 1984; Miyata et al., 1985). 


Nuclear receptor genes. Nuclear receptors are ligand-activated transcription 
factors that regulate the expression of their target genes by binding to specific cis- 
acting sequences in their promoter regions. This superfamily may be divided into 
a minimum of three distinct groups, one containing the receptors for steroid hor- 
mones (glucocorticoids, androgens, estrogens, progesterone, etc), a second con- 
taining receptors for vitamin D, thyroid hormone and retinoic acid and a third 
containing various ‘orphan’ receptors which putatively interact with ligands that 
still remain to be identified. Nuclear receptors exhibit a modular organization 
and contain at least four domains: an A/B domain involved in transactivation, a 
highly conserved zinc finger-containing domain (C) involved in DNA binding, a 
hinge (D) domain and a carboxy terminal (E) domain that is required for ligand 
binding, dimerization and transcriptional regulation. 

Human genes belonging to this superfamily include the androgen receptor 
(AR; Xql1-q12), estrogen receptor (ESR1; 6q25), glucocorticoid receptor (GRL; 
5q31), mineralocorticoid receptor (MLR; 4q31), progesterone receptor (PGR; 
11q22), retinoic acid receptors œ (RARA; 17q21), B (RARB; 3p24) and y (RARG; 
12q13), thyroid hormone receptors a (THRA; 17q21) and B (THRB; 3p24) and 
vitamin D receptor (VDR; 12q12-q14). Genes encoding the ‘orphan’ receptors 
include hepatocyte nuclear factor 4 (HNF4A; 20q12-q13), the COUP transcrip- 
tion factors (TFCOUPI1, 5q14; TFCOUP2; 15q26), the retinoid X receptors 
(RXRA, 9q34; RXRB, 6p21.3; RXRG, 1q22-q23) and the peroxisome prolifera- 
tor-activated receptors œ (PPARA; 22q12-q13), y (PPARG) and 6 (PPARD; 
6p21). It can be seen that the retinoic acid and thyroid hormone receptor genes 
are arranged in two syntenic groups. 

The ancestral nuclear receptor gene may have originated very early by fusion of 
DNA-binding and steroid-binding domains (Amero et al., 1992; Moore, 1990; 
O’Malley, 1989). Indeed, Escriva et al. (1997) have proposed that the ancestral 
nuclear receptor was an orphan receptor which subsequently acquired its ligand- 
binding potential. Since numerous vertebrate nuclear receptors have Drosophila 
homologues, the nuclear receptor superfamily must have diversified before the 
divergence of arthropods and vertebrates more than 500 Myrs ago. This diversifi- 
cation may have proceeded along the lines of the schema laid out in Figure 4.28. 
Two waves of gene duplication occurred, one very early on giving rise to the dif- 
ferent receptor groups and a second in the vertebrate lineage. This was accompa- 
nied by domain shuffling between genes (Escriva et al., 1997; Laudet, 1997; 
Laudet et al., 1992). Sequence from the thyroid hormone receptor related gene, 
THRAL, appears to have been translocated to the THRA locus thereby creating 
the final THRA exon. This must have occurred early on in mammalian evolution 
since this organization is present in rat and human but not in chicken (Laudet et 
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Figure 4.28. Evolutionary schema for the human nuclear receptor genes (redrawn from 
Laudet et al., 1992). 


al., 1992; Figure 4.28). An E domain from subfamily I has been been acquired by 
the VDR gene which distinguishes it from the other genes in the steroid hormone 
receptor family (Laudet et al., 1992; Figure 4.28). There appears however to be no 
relationship between the phylogenetic position of a liganded receptor and the 
chemical nature of its ligand (Escriva et al., 1997). 


Protein kinase C genes. The members of the mammalian protein kinase C 
superfamily are involved in a number of signalling systems underlying a range of 
cellular processes. The characterized members of the human superfamily can be 
divided into three subfamilies, class I: œ (PRKCA; 17q22-q23), B (PRKCB1; 
16p11) and y(PRKCG; 19q13), class II: 5 (PRKCD; 3p) and 0 (PRKCQ; 10p15), 
class III: e (PRKCE), class IV: 1 (PRKCI; Xq21.3), and ¢ (PRKCZ). These genes 
have an ancient origin, their homologues having been found in both nematode 
and yeast (Mellor and Parker, 1998). The deduced amino acid sequences are 
highly conserved among mammals (>95% homology between humans and 
rodents) although homology is lower (45-65%) between nematodes and mam- 
mals. The expression pattern of the different proteins can however vary markedly. 
Structurally, each subfamily is characterized by a different arrangement of regula- 
tory domains which serve to determine signalling specificity (Mellor and Parker, 
1998). 


Serine protease genes. The serine protease superfamily provide an archetypal 
example of evolution by gene duplication and divergence coupled with exon 
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shuffling. Thus, mast cell chymase (CMAJ; 14q11.2; Huang and Hellman, 1994), 
the digestive pancreatic proteases [trypsin (PRSSI and PRSS2; 7q35), chy- 
motrypsin (CTRB1; 16q23) and elastase (ELAJ; 12q13)] and the hepatic pro- 
teases of coagulation (e.g. thrombin; F2; 11p11l-q12) have all evolved from the 
same common ancestral precursor gene (Greer, 1990). This ancestral gene appears 
to predate the divergence of eukaryotes and prokaryotes (Rypniewski et al., 1994) 
and may itself have been the product of an internal duplication (McLachlan, 
1979). Enormous structural and functional diversity has been generated both by 
the addition of non-proteolytic domains and by changes in the enzyme active sites 
which account for differences in their specificity. 

The vitamin K-dependent serine proteases of coagulation exhibit substantial 
sequence and structural homology and their evolution is described in some detail 
in Section 4.3 and Chapter 3, section 3.6.3 and Chapter 10, section 10.2. Human 
representatives of this serine protease family include factor VII (F7; 13q34), fac- 
tor IX (F9; Xq27), factor X (F10; 13q34), prothrombin (F2; 11p11-q12), protein 
C (PROC; 2q13-q14) and protein S (PROS1; 3p11). Various serine proteases have 
also been recruited to functions in fibrinolysis e.g. plasminogen (PLG; 6q26-q27), 
tissue-type plasminogen activator (PLAT; 8pll-q12) and urokinase (PLAU; 
10q24-qter) and, phylogenetically, these are closely related to the contact factors: 
factor XI (F11; 4q34), factor XII (F12; 5q33-qter) and plasma kallikrein (KLK3; 
4q34-q35) (Figure 3.7). 

The division between the proteases of coagulation and fibrinolysis is apparent 
at the level of exon/intron organization but also in terms of codon usage for the 
active site serine residue (Brenner, 1988). The genes encoding the vitamin K- 
dependent factors of coagulation possess an AGY codon whereas the fibrinolytic 
enzyme genes exhibit a TCN codon also found in the serine protease genes of 
eubacteria and invertebrates. These alternative codons cannot be interconverted 
by a single nucleotide substitution. Brenner (1988) proposed that the AGY codon 
could have been derived from an active cysteine residue encoded by TGY in a cys- 
teine protease that existed billions of years ago. Irwin (1988) has however argued 
that the AGY codon evolved from a TCN codon on at least two separate occasions, 
once on the lineage leading to the vitamin K-dependent factors of coagulation and 
once on the lineage leading to plasminogen and apolipoprotein(a). 

Hypervariability of the active site regions is apparent in some serine proteases 
indicating that rapid evolution of the reactive center may have driven the func- 
tional divergence of these enzymes (Creighton and Darby, 1989; Huang and 
Hellman, 1994; Lesk and Fordham, 1996; Ohta, 1994). Since a similar phenome- 
non is apparent for the serine protease inhibitors (see Section 4.2.3, Serpin genes), 
it may be that the proteolytic enzymes and their inhibitors have coevolved. 


Serpin genes. The serine protease inhibitors (Serpins) are a superfamily of pro- 
teins with over 100 members in mammals, and counterparts in invertebrates, 
plants and even some viruses (Marshall, 1993). Serpins interact with the sub- 
strate-binding sites of their cognate proteases via an exposed binding site of 
canonical conformation (Bode and Huber, 1992). The protease and its inhibitor 
rapidly form a tightly bound 1 : 1 stoichiometric complex with the reactive center 
of the serpin acting as a bait for the appropriate serine protease. Mammalian 
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serpins share a common ancestor some 500 Myrs ago (Bao et al., 1987) and include 
al-proteinase inhibitor (PI; 14q32), al-antichymotrypsin (AACT; 14q32), 
heparin cofactor II (HCF2; 22q11), antithrombin (AT3; 1q23), a2-antiplasmin 
(PLI; 17p13), pigment epithelium-derived factor (PEDF; 17p13), Cl-inhibitor 
(C1I; 11q11-q13), protein C inhibitor (PCI; 14q32) and plasminogen activator 
inhibitors 1 and 2 (PAIL, 7q21-q22; PAI2, 18q21.3). Other less well characterized 
mammalian serpins include cytoplasmic antiproteinase 2 (PJ8; 18q21.3), maspin 
(PIS; 18q21.3), nexin (PI7; 2q33-q35), kallistatin (PI4; 14), neuroserpin (PI12; 
3q26). Thus, serpin gene clusters are found at 14q32, 17p13 and 18q21.3 in the 
human genome. 

Not all serpins act as protease inhibitors. Thus, angiotensinogen (AGT; 1q41) 
and corticosteroid-binding globulin (CBG; 14q32) act as blood pressure regula- 
tory hormones whilst ovalbumin functions as an avian egg storage protein. 
Although there is no ovalbumin gene in the human genome, there are several 
genes encoding ovalbumin-like serpins (ov-serpins) which are clustered either on 
18q21.3 (PI10, SCCA1, SCCA2) or 6p25 (PI2, PI6, PI9). The ov-serpins on chro- 
mosome 6 share more amino acid sequence identity with one another than they 
do with their chromosome 18 counterparts with the exception of PI8 (Bartuski et 
al., 1997). By contrast, most of the chromosome 18 ov-serpins share greater 
sequence identity with a chromosome 6 ov-serpin than with each other. To 
account for these observations, Bartuski et al. (1997) proposed that the 6p25 loci 
arose first and then gave rise to the chromosome 18 cluster. 

Dendrograms derived from comparisons of serpin nucleotide and amino acid 
sequences and of intron positions differ significantly (Wright, 1993). The 
exon/intron organization of extant serpin genes is therefore unlikely to be explic- 
able simply in terms of the loss of introns from a large primordial gene. Wright 
(1993) proposed that the serpin gene family arose from an early recombination 
event which fused the amino and carboxyl domains. The subsequent insertion of 
sequence (possibly intronic) served to create B-sheet A and stabilized the new 
structure. Few of the introns demarcate regions of secondary or tertiary structure 
and further insertions, deletions and migrations of introns must have occurred. 

Substrate specificity manifested by the serpin is determined, at least in part, by 
the P1 residue of the bait loop which contains Met or Val for elastase, Lys for 
trypsin, Leu for chymotrypsin and Arg for thrombin. After gene duplication, 
divergence in terms of substrate specificity has been very rapid and has resulted 
from ‘accelerated evolution’ of the reactive center region (Hill and Hastie, 1987). 
This rapid evolution is thought to have taken place by a combination of mecha- 
nisms, from genetic drift to gene conversion to positive Darwinian selection 
(Graur and Li, 1988; Hill and Hastie, 1987; Ohta, 1994). 


Zinc finger genes. The term ‘zinc finger’ refers to a 28 amino acid sequence motif 
which binds zinc ions thereby stabilizing the structure of a small DNA-binding 
domain (El-Baradi and Pieler, 1991). It has been estimated that between 300 and 700 
human genes encode zinc finger-containing proteins (Hoovers et al., 1992). Most of 
these proteins belong to the Kruppel-type family and act as transcription factors (El- 
Baradi and Pieler, 1991). Zinc finger proteins often contain multiple zinc finger 
motifs, the number varying from between 2 and 37 (Klug and Schwabe, 1995). 
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To date, some 250 human zinc finger genes have been allocated a symbol and 
some have been chromosomally localized. Inspection of the available mapping 
data reveals at least 9 distinct clusters of zinc finger genes in the human genome: 
3p21-p22 (ZNF52, ZNF64, ZNF35, ZNF 166, ZNF167, ZNF168), 6p21 (ZNF76, 
ZNFI165, ZNF173, ZNF204), 8q24 (ZNF7, ZNF16, ZNF34), 10p11 (ZNFIIA, 
ZNF25, ZNF33A, ZNF37A), 10q11 (ZNFIIB, ZNF22, ZNF33B, ZNF37B), 
11q23 (ZNF123, ZNF125, ZNF128, ZNF129, ZNF145), 19p12-p13 (ZNF43, 
ZNF56, ZNF58, ZNF66, ZNF67, ZNF85, ZNF90, ZNF91, ZNF92, ZNF208), 
19q13 (ZNF42, ZNF45, ZNF83, ZNF93, ZNF132, ZNF134, ZNF135, ZNF136, 
ZNF137, ZNF146, ZNF154, ZNF155, ZNF160, ZNF175), 22q11 (ZNF69, 
ZNF70, ZNF71, ZNF74). This superfamily has presumably evolved through 
cycles of gene transposition and duplication. Duplications in the ZNF91 family 
are known to have occurred some 55 Myrs ago in the common ancestor of the 
simians (Bellefroid et al., 1995). 

The human ZNF45 and ZNF93 genes are very similar to their murine ortho- 
logues although the human genes encode more zinc finger repeats than their 
murine counterparts, consistent with the occurrence of intragenic deletions/ 
duplications (Shannon and Stubbs, 1998). Similarly, the human transcription fac- 
tor MOK2 (MOK2; 19q13.2-q13.3) contains 10 zinc finger motifs in comparison 
to 7 in the murine homologue (Ernoult-Lange et al., 1995). 


Olfactory receptor genes. The olfactory system is thought to be capable of dis- 
tinguishing several thousand odorant molecules. This is potentiated by olfactory 
receptors (ORs) which are responsible for the recognition and G protein-medi- 
ated transduction of specific odorant signals. With possibly 1000 members in the 
human genome, the OR gene superfamily constitutes by far the largest family 
encoding G protein-coupled receptors. OR genes are intronless and occur in clus- 
ters that are present at more than 25 chromosomal locations in the human 
genome. However, more than 70% of human OR-homologous sequences are prob- 
ably pseudogenes (Rouquier et al., 1998; Trask et al., 1998). 

Human OR gene clusters appear to be disproportionately located in subtelom- 
eric regions and have been subject to frequent duplications and inter-chromoso- 
mal rearrangements (Trask et al., 1998) some of which appear to have been 
mediated by recombination between repetitive sequence elements (Glusman et al., 
1996). OR genes within the clusters belong to at least four different subfamilies 
which display as much sequence variability within clusters as between clusters 
(Ben-Arie et al., 1993). The classification of human OR genes is still in its infancy 
but chromosomally localized members include ORIA1, 17p13; ORID2, ORID4, 
ORIDS, 17p13; OR1E1, ORIE2, 17p13; ORIFI, 16p13; ORIGI, 17p13; OR2D2, 
11p15; OR3A1, OR3A2, OR3A3, 17p13; ORS5D3, OR5S5D4, 11q12; ORSFI, 11q12; 
OR6A1, 11p15; ORIOAI, 11p15. 

The olfactory system is combinatorial in that one OR can recognize multiple 
odorant molecules, and one odorant molecule may be recognized by multiple 
ORs, whilst different odorants are recognized by different combinations of ORs 
(Malnic et al., 1999). Each nasal olfactory sensory neuron expresses only one allele 
of a single OR gene (Chess et al., 1994; Sullivan et al., 1996). In the olfactory 
epithelium, different sets of ORs are expressed in distinct spatial zones; neurons 
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expressing a given OR gene are located in the same zone but in that zone they are 
interspersed with neurons expressing other ORs (Sullivan et al., 1996). However, 
OR genes expressed in the same zone map to numerous different loci whereas a 
single OR locus may contain genes expressed in different zones (Sullivan et al., 
1996). Since some of the OR pseudogenes have been found to be expressed, it is 
possible that some of those neurons that only express a single receptor type may 
be nonfunctional (Crowe et al., 1996). 

The origin and emergence of the olfactory receptor gene family probably pre- 
ceded the divergence of the mammals (Ben-Arie et al., 1993; Issel-Tarver and 
Rine, 1997). In the primates, the olfactory receptor genes appear to have been in a 
considerable state of flux with numerous translocations, duplications and dele- 
tions occurring during the evolution of the great apes (Trask et al., 1998). The ori- 
gin of ORs is unclear but their expression during spermatogenesis and their 
presence in mature sperm cells suggests that their original function could have 
been in sperm physiology (Vanderhaeghen et al., 1997a, 1997b). 


4.2.4 Genes which undergo programmed rearrangement 


The immunoglobulin and T-cell receptor genes are members of the immunoglob- 
ulin superfamily and share the common property of being assembled from multi- 
ple gene coding segments in the lymphocyte. In the germline, the genes encoding 
the variable portions of these receptors are usually split into variable (V), J Goin- 
ing) and D (diversity) segments that are joined together by a somatic process of 
site-specific recombination that involves the recombination-activating proteins 
RAGI and RAG2 (see Chapter 9, section 9.4.2) to generate exons that encode the 
antigen-binding portion of the polypeptide. 


Immunoglobulin genes. A diverse repertoire of human antibodies is generated 
by the combinatorial somatic rearrangement of a relatively small number of gene 
segments, variable (V,,), diversity (D) and joining (J,,) segments for the heavy 
chain V region, and variable (V,) and joining (J, ) segments for the light chain V 
region. The J,, and J, segments are spliced to the constant region genes of the 
heavy (C,,) and light (C,) chains following mRNA transcription. 

The human immunoglobulin genes are mainly distributed between three chro- 
mosomal locations: 14q32 for the heavy chain loci, 2p12 for the « light chain loci 
and 22q11 for the A light chain loci. The 1100 kb human heavy chain locus con- 
tains 51 functional V,, gene IGHV) segments, ~30 D segments (GHDY), 6 func- 
tional Ją segments (GH) and 9 functional C,, genes: 1 (IGHM), 6 GHD), ¥3 
(IGHG3), y1 GGHG1), al UGHA1), y2 IGHG2), y4 IGHG4), e1 UGHE) and a2 
(IGHA2) (Figure 4.29; Cook and Tomlinson, 1995). The human C,, gene locus 
clearly exhibits dyadic symmetry as a result of a duplication event that included 
the Cy-Cy-Ce-Co gene cluster in the ancestor of the great apes (Kawamura and 
Ueda, 1992). The symmetry is imperfect in extant species because the Ce gene was 
lost through independent deletion events in humans, chimpanzees, gorilla and 
the white-handed gibbon (Hylobates lar) whereas in the orangutan, the Ca gene 
was deleted as well as the Ce gene (Kawamura and Ueda, 1992). 
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Figure 4.29. Schematic organization of the human immunoglobulin heavy chain (IGHV) 
locus. 


The x light chain locus contains 76 V, genes (IGKV) and pseudogenes, 5 
functional J, segments (GKF) and a single C, gene (IGKC). This locus consists of 
two copies of a DNA region arranged with opposite polarity (Weichhold et al., 
1993), the result of an intra-locus duplication which has occurred since the diver- 
gence of human and chimpanzee (Ermert et al., 1995). The à light chain locus con- 
tains contains 36 functional V, genes (IGLV), 33 V, pseudogenes, 34 V, ‘relics’ 
containing large deletions or insertions, 6 J, segments (IGLẸF) and 6 C, genes 
(GLC) (Kawasaki et al., 1997). In addition to the above, 8 V,, segments IGHV2) 
and a cluster of D segments JGHDY2) have been identified on chromosome 
15q11.2 with a further 16 V,, segments (GHV3) on chromosome 16p11.2 
(Tomlinson et al., 1994). These ‘orphon’ V,, sequences are thought to have been 
translocated to chromosomes 15 and 16 some 20 Myrs ago (Matsuda and Honjo, 
1996) and ~40% may be functional. A small cluster of orphon V« sequences has 
also been found on chromosome 22, and flanking direct and inverted repeats have 
been invoked to account for their transposition (Borden et al., 1990). 

It has been apparent for some time that the heavy and light immunoglobulin 
chains are homologous and that these proteins must have been assembled by 
extensive duplication of a short ancestral 100 amino acid polypeptide chain (Hill 
etal., 1966). This domain, the immunoglobulin fold, has now been found in a very 
wide array of different proteins throughout the animal kingdom. The 
immunoglobulins of the immune system appear however to be a vertebrate cre- 
ation (Schluter et al., 1997). 

The immunoglobulin genes are thought to have originated very early in the 
evolution of the jawed vertebrates (Rast et al., 1997). Phylogenetic analyses of the 
human V,, heavy chain genes (Haino et al., 1994; Vargas-Madrazo et al., 1997), C,, 
genes (Takahashi et al., 1982), V, light chain genes and pseudogenes (Kawasaki et 
al., 1997) and Vx light chain genes (Kurth et al., 1993; Sitnikova and Nei 1998; 
Sitnikova and Su 1998; Vargas-Madrazo et al., 1997) are consistent with a history 
of multiple, successive duplication events, both extensive and of more limited 
extent, during the evolution of all branches of the immunoglobulin gene family. 
This mode of evolution is reminiscent of a ‘birth and death process’ rather than 
any mechanism of concerted evolution (Section 4.2.1). Some immunoglobulin 
genes have have been present in vertebrate genomes for >400 Myrs, others are 
more recent creations that have emerged during the mammalian radiation 
(reviewed by Andersson and Matsunaga, 1995; Matsuda and Honjo, 1996; 
Schluter et al., 1997). The expansion of the immunoglobulin gene family was 
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almost certainly an adaptive response to the wide range of novel pathogens that 
challenged early tetrapods in their new terrestrial environment. 

The evolution of the immunoglobulin genes is subject to at least two distinct 
types of constraint: existing gene segments must be conserved in order to meet 
recurring immunological challenges but diversity must be continually created to 
meet new immunological challenges. The duplication of gene segments provides 
a means to achieve this by ensuring that old structures/functions are maintained 
at the same time as new ones are being created. 


Fcell receptor genes. The T-cell receptor plays an important role in antigen 
recognition during the immune response. The four T-cell receptor chains (a, B, Y, 
and 5), members of the immunoglobulin superfamily, dimerize to form two types 
of receptor (a/B, y/5). In the human genome, these chains are encoded by four 
genes: TCRA and TCRD (14q11.2), TCRB (7q35), and TCRG (7p14-p15). The 
location of the TCRD gene is highly unusual in that it lies within the TCRA gene: 
the multiple gene units are organized as Va-V5-D6-J5-Cd-Ja-Ca (Hockett et al., 
1988). Each locus comprises a large number of tandemly arrayed variable (V) 
genes, diversity (D) segments (TCRB and TCRD), a number of clustered joining 
(J) elements and one or two constant (C) region genes. These gene segments 
undergo somatic rearrangement during lymphocyte differentiation to yield either 
V/J or V/D/J genes. As with the immunoglobulin genes (Section 4.2.4, 
Immunoglobin genes), the origin of the T-cell receptor genes probably occurred very 
early in the evolution of the jawed vertebrates (Rast et al., 1997). The tandem 
arrays of V genes seen in extant vertebrate genomes have arisen subsequently by 
serial duplication, generating the sizeable repertoire of T-cell receptor V genes 
essential for creating diversity of antigen recognition specificity. 

The TCRB gene has been the best characterized of the four loci with the 
sequencing of 685 kb from the region (Rowen et al., 1996). This region contains 46 
variable gene segments, 19 pseudogenes and two clusters of D, J, and C segments. 
A further 22 additional sequences, termed ‘relics’ by the authors, were also iden- 
tified. These sequences exhibited limited homology to V gene segments and rep- 
resent partial pseudogenes extensively altered by insertions and deletions. Some 
30% of the TCRB locus is composed of genome-wide interspersed repeats but 
these sequences do not appear to have facilitated the duplication of locus-specific 
repeats. Some of these repeats within the TCRB locus have been shown to harbor 
trypsinogen genes (PRSS1, PRSS2). These genes must have been conserved for 
at least 350 Myrs because a trypsinogen gene cluster is present at this location in 
both mouse and chicken. Higher primates contain similar numbers of V gene seg- 
ments at the TCRB locus: human (47), gorilla (45), orangutan (57), macaque (57) 
with the greater number of V genes in orangutan and macaques being due to 
amplification of the TCRBV7, 9 and 23 subfamilies (Charmley et al., 1995). 


4.3 Convergent evolution 


Most members of the gene and protein families discussed in Section 4.2 have been 
subject to divergent evolution, the process by which homologous proteins or 
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domains develop from a common origin but gradually acquire their own unique 
identity in terms of structure and function. By contrast, convergent evolution refers 
to the similarity between two protein structures, amino acid sequences or 
nucleotide sequences due to their independent evolution from different origins 
rather than through their possession of a common ancestor. Evolutionary conver- 
gence therefore implies ‘adaptive change in which lesser related entities come to 
appear more related than they are’ (Doolittle, 1994). The term convergence has 
however been used in many different contexts with very different meanings. 
Doolittle (1994) distinguished between functional convergence, mechanistic con- 
vergence and structural convergence. 

Functional convergence is exemplified by various pairs of enzymes that have 
evolved independently to catalyse the same biochemical reactions, for example 
superoxide dismutases, aldolases, alcohol dehydrogenases and topoisomerases 
(Doolittle, 1994). Myoglobin from the abalone Sulculus diversicolor exhibits func- 
tional convergence with vertebrate myoglobin although it is not in any way 
homologous to it (Suzuki et al., 1996). Instead, the Sulculus myoglobin gene is evo- 
lutionarily related to the human indoleamine 2,3-dioxygenase (IDO; 8p11-p12) 
gene (Suzuki et al., 1996). The original Sulculus myoglobin gene may have been 
lost and a modified Ido gene could have evolved as a substitute. 

Perhaps the best example of mechanistic convergence is provided by the serine 
proteases subtilisin and chymotrypsin which, although unrelated evolutionarily, 
have independently evolved similar enzymatic mechanisms. Chymotrypsin has a 
catalytic triad with a histidine at residue 57, an aspartate at residue 102 and a ser- 
ine at residue 195. The bacterial protein subtilisin possesses a completely differ- 
ent structure but also has a catalytic triad comprising an aspartate at residue 32, a 
histidine at residue 64 and a serine at residue 221. The human genome contains 
representatives of both families. Thus, the chymotrypsinogen B (CTRB1; 16q23), 
chymotrypsin-like protease (CTRL; 16q22), trypsin 1 (PRSS1; 7q35), trypsin 2 
(PRSS2; 7q35), elastase 1 (ELAI; 12q13), plasminogen (PLG; 6q26) and pro- 
thrombin (F2; 11p11-q12) genes are members of the chymotrypsin family of ser- 
ine proteases. Human genes encoding subtilisin-like serine proteases include the 
proprotein convertases (PCSK1, 5q15-q21; PCSK2, 20p11; PCSK4, chromosome 
19; PCSKS, 9q21.3), furin (PACE; 15q25-q26) and paired basic amino acid cleav- 
ing enzyme 4 (PACE4; 15q26). Another example of mechanistic convergence is 
provided by the cold shock domain protein family and the family of RNA-bind- 
ing proteins that contain an RNA-binding domain. Both of these domains con- 
tain conserved ribonucleoprotein motifs on similar single stranded nucleic 
acid-binding surfaces (Graumann and Marahiel, 1996). 

Structural convergence, on the other hand, may reflect the tendency of specific 
amino acid sequences to fold into certain favored conformations. Thus, struc- 
turally dissimilar families of transport proteins have been found to exhibit simi- 
lar structural units consisting of six tightly packed a-helices which may comprise 
all or part of a transmembrane channel (Saier, 1994). 

In none of the above examples is there any evidence for sequence convergence. For 
two sequences to be shown to display convergence in this strict sense, they would 
not only have to be shown to be evolutionarily unrelated but chance would also 
have to be excluded as a reason for their similarity. Indeed, sequence convergence 
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would imply the occurrence of adaptive amino acid replacements that have been 
positively selected during evolution. Although sequence convergence has been 
invoked in many different situations, there are still no convincing examples of 
unrelated sequences which adhere to these criteria such that they warrant the use 
of the term sequence convergence. 


4.4 Coevolution 


Coevolution can be said to be operating in cases where the process of evolutionary 
change experienced at one locus is influenced by changes that have occurred at 
another locus. At its simplest, coevolution can occur when two genes are inti- 
mately associated as in the case of shared bidrectional promoter elements 
(Chapter 5, section 5.1.5). More complex situations involve unlinked genes. 
Fryxell (1996) proposed that ‘the acquisition of a novel function by a duplicated 
gene could be facilitated by pre-existing heterogeneity in proteins that interact 
directly with the product of the duplicated gene’. Thus, the duplication and func- 
tional divergence of one gene might serve to create an altered genetic environ- 
ment that could promote the divergence of duplicate copies of genes encoding 
proteins that interact with the protein products of the first gene. 

In a study of the interspecies diversity manifested by a series of 48 ligand-recep- 
tor pairs, Murphy (1993) demonstrated there to be a linear relationship between 
receptor divergence and ligand divergence. Interestingly, the inter-species differ- 
ences in receptor structure were nonrandomly distributed and largely confined to 
the extracellular domains that interact with the ligand. Since this study employed 
ligand-receptor pairs from diverse systems (host defence proteins, neurotransmit- 
ters, hormones, growth factors and cell adhesion proteins), it is reasonable to sup- 
pose that the coevolution of genes encoding interacting proteins is not an 
uncommon phenomenon. 

The coevolution of families of receptors and their ligands is perhaps best exem- 
plified by the insulin-nerve growth factor family and their receptors (Section 4.2.3, 
Insulin and insulin-like growth factor genes; Fryxell, 1996). These ligand-receptor 
pairs include insulin (ZNS; 11p15.5) and insulin receptor (INSR; 19p13), insulin- 
like growth factor 1 (IGF1; 12q22-q23) and its receptor JGFIR; 15q25-qter), 
brain-derived neurotrophic factor (BDNF; 11p13) and neurotrophin 5 (NTF5; 
19q13.3) and their cognate receptor neurotrophic tyrosine kinase receptor type 2 
(NTRK2; 9q22.1), neurotrophin 3 (NTF3; 12p13) and neurotrophic tyrosine 
kinase receptor type 3 (NTRK3; 15q25), nerve growth factor (NGFB; 1p13.1) and 
neurotrophic tyrosine kinase receptor type 1 (NTRK1; 1q21-q22). The ligand- 
encoding genes share a common ancestry as do the genes encoding their receptors 
(Fryxell, 1996). As new trophic factors emerged by duplication and divergence, so 
their cognate receptors evolved by a similar parallel process. Other examples of the 
coevolution of ligands and their receptors include interleukin 8 (IL8; 4q13-21) 
and its receptors (IL8RA, IL8RB; 2q35; Ahuja et al., 1992), interleukin 4 (IL4; 
5q23-q31), and interleukin 4 receptor UL4R; 16p11-p12; Richter et al., 1995), the 
gonadotropins and their receptors (Moyle et al., 1994) and the nuclear receptors 
and their ligands (Escriva et al., 1997; section 4.2.3, Nuclear receptor genes). 
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Coevolution occurs between genes encoding different subunits of het- 
erodimeric proteins, for example the myc proto-oncogene (MYC; 8q24) and its 
dimerization partner max (MAX; 14q23) (Atchley and Fitch 1995), the genes 
encoding the type I and II keratins (Section 4.2.1, Keratin genes) and the genes 
encoding the &- and B-integrin chains (Hughes 1992; Section 4.2.1, Integrin genes). 
Coevolution can also occur between agonists and antagonists, for example inter- 
leukins 1a ULIA; 2q13) and 18 (LIB; 2q13-q21), and the interleukin 1 receptor 
antagonist (ILIRN; 2q14.2) which binds to the IL1 receptor and blocks IL1a and 
ILIB binding without inducing a signal of its own; all three genes share a com- 
mon ancestry (Eisenberg et al., 1991). The cytokines and their receptors have both 
arisen by a process of gene duplication and divergence and although the relative 
timing of these duplicative events is still unclear, evidence is emerging for ligand- 
receptor coevolution (He and Wu, 1993; Kosugi et al., 1995; Shields et al., 1995). 

One prediction of Fryxell’s (1996) hypothesis is that duplication/divergence 
events in functionally related gene families may be temporally correlated. This 
prediction appears to be bourne out at least for the a- and B-integrin chain genes 
(Hughes, 1992) and the fibroblast growth factor/fibroblast growth factor receptor 
genes (Coulier et al., 1997). 

Whilst numerous examples of ‘ligand promiscuity’ have been recognized e.g. 
the type I interferon genes clustered on 9p21 (see Section 4.2.3, Interferon genes), 
‘receptor promiscuity’ appears to be much rarer (Ahuja et al., 1992). Examples of 
receptor promiscuity include the interleukin 8 receptors (L8RA, IL8RB; 2q35; 
Ahuja et al., 1992) and the interferon receptors a, B, and œ, 1 (FNAR1) and 2 
(TFNAR2) and interferon receptor y2 (IFNGR2) whose genes are closely linked to 
each other on 21q22.1. 

The coevolution of ligand-receptor pairs by parallel pathways of gene duplica- 
tion and functional divergence has probably been facilitated in two quite distinct 
ways. Firstly, the close linkage of multiple genes encoding either ligands or their 
receptors will have served to promote gene duplication. Secondly, whole genome 
duplications early on in vertebrate evolution (Chapter 2, section 2.1) may have, by 
simultaneously increasing both ligand and ligand receptor diversity, provided the 
raw material for selection to recruit novel receptor-ligand interactions thereby 
potentiating the dramatic increase in the biochemical and physiological complex- 
ity characteristic of the vertebrates. 
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Promoters and 
transcription factors 


The eukaryotic transcriptional apparatus is much more complex than that of 
prokaryotes and this complexity is bound up with the fact that many eukaryotic 
genes are silenced by being tightly packed in chromatin (Zuckerkandl, 1997). 
Transcriptional repression was a necessary pre-condition for the development of 
large genomes since increasing gene number required gene expression to be 
restricted both spatially and temporally. That nucleosomal packing and tran- 
scriptional repression have, in eukaryotes, long been associated with each other is 
evidenced by the discovery that the core histone fold is not confined to the his- 
tones but has also turned up in a number of transcriptional coactivators and 
repressors (Ouzounis and Kyrpides, 1996a). This ancient association is also evi- 
dent in the High Mobility Group (HMG) superfamily (Section 5.2) which may be 
divided into two distinct subfamilies comprising transcription factors and chro- 
matin structure regulatory proteins, respectively. 

The evolution of cellular processes probably involved several distinct stages: 
metabolism and translation preceded transcription whereas transcription must 
have preceded transcriptional repression by chromosomal packing at the dawn of 
the eukaryotic era. Genome expansion was potentiated by chromosomal packing 
whilst gene regulation became ever more complex as new transcription factors 
emerged (Ouzounis and Kyrpides, 1996b). The last universal ancestor probably 
possessed basic molecular components of metabolism and translation while hav- 
ing a prokaryote-like genome organization together with a transcriptional system 
reminiscent of the archaea. 


5.1 Promoters and enhancers 


The basic mechanism of transcriptional activation has been conserved across the 
spectrum of eukaryotes from yeast to human (Schena, 1989). It is therefore not 
surprising that two sequence elements, the TATAAA and CCAAT boxes which 
serve to coordinate the basal transcriptional machinery, are found in all eukary- 
otes. Indeed, these motifs occur in the promoters of a wide variety of RNA poly- 
merase II-transcribed genes. Similarly, the initiator element, a functional 
analogue of the TATA box, appears to be ubiquitous in eukaryotes (Liston and 
Johnson, 1999). Other sequence motifs also occur, in different combinations and 
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permutations, thereby giving eukaryotic promoters their modular character 
(Mushegian and Koonin, 1996; reviewed by Nussinov, 1990). 

Transcriptional activators bind to promoter sequences in different combina- 
tions and permutations, some factors being coordinated by cis-acting DNA 
sequence motifs, others by protein-protein interactions on the promoter. The 
effect of any given activator is therefore likely to depend on the other activator 
and repressor proteins present which may bind either competitively or coopera- 
tively. Such a process of ‘combinatorial control’ has, by maximizing flexibility, 
potentiated the rapid evolution of quite elaborate gene regulatory networks 
(Ptashnel, 1997). Those seeking a reference source to cis-acting DNA sequence 
motifs may consult the Transcription Regulatory Regions Database (http: 
//www.bionet.nsc.ru/trrd) or the Eukaryotic Promoter Database (http: 
/[www.epd.isb-sib.ch) both of which contain information on regulatory regions 
of eukaryotic genes and include data on transcription factor binding sites, pro- 
moters, enhancers and silencers. 

If we take the simple case of two genes recently duplicated at the genomic DNA 
level, then we might reasonably expect that their promoter regions would initially 
be very similar. As a consequence, the expression profiles of these genes might 
also be expected to be very similar if not identical. However, gene duplication 
generates redundancy allowing the promoters to be freed from the constraints of 
selection thereby enabling them to acquire mutations. Such mutations can lead to 
the inactivation of the gene by abolishing its expression (see Chapter 6) or alter- 
natively can, in rather more subtle ways, serve to change its expression pattern. 

In principle, promoters can change either by the slow steady acquisition of sin- 
gle base-pair substitutions in pre-existing sequence motifs or, more dramatically 
and abruptly, by promoter shuffling, the gain or loss of individual regulatory ele- 
ments exchanged between genes in cassette fashion (Surguchov, 1991). Promoter 
shuffling may have occurred in cases of homologous genes that contain dissimilar 
upstream regulatory elements but also in cases of nonhomologous genes contain- 
ing similar upstream regulatory elements. For promoter shuffling to be viable, the 
transposed sequence element must be capable of exerting an influence on the 
transcription of the gene that has just acquired it. That this proposition is a rea- 
sonable one is evidenced by the results of studies reported by Kermekchiev et al. 
(1991). These authors tested 27 combinations of different promoters and 
enhancers and found that the relative in vitro efficiency of the enhancers was 
roughly the same irrespective of the promoter used. Another way in which pro- 
moter sequences can change abruptly is through gene conversion as described for 
the human growth hormone (GH1; 17q22-q24; Giordano et al., 1997) and y-globin 
(HBG1, HBG2; 11p15.5; Chiu et al., 1997) gene promoters. 

In subsequent sections, the similarities and differences between the promoters 
of extant mammalian genes will be explored. In addition, some of the mechanisms 
by which evolution has recruited specific sequences to a promoter function and 
fine-tuned their interactions with transcription factors will be described. 
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5.1.1 Evolutionary conservation of cis-acting elements and ‘phylogenetic 
footprinting’ 


Sequence conservation in promoter regions is usually held to be an indicator of 
functional importance and may therefore be used as a rough guide to the location 
of cis-acting regulatory elements that might bind trans-acting factors. Such phylo- 
genetic footprinting has proven successful in locating trans-acting factor binding 
sites in a considerable number of human genes including the ¢-globin (HBE1; 
11p15.5) gene (Gumucio et al., 1993; Hardison et al., 1997), the yl-globin (HBG1; 
11p15.5) gene (Tagle et al., 1988), the B-globin (HBB; 11p15.5) gene locus control 
region (Shelton et al., 1997), the cytochrome c oxidase subunit Vb (COX5B; 2cen- 
q13) gene (Bachman et al., 1996), the Duchenne muscular dystrophy (DMD; 
Xp21) gene (Fracasso and Patarnello 1998), the hormone-sensitive lipase (LIPE; 
19q13) gene (Talmud et al., 1998) and the cystic fibrosis transmembrane conduc- 
tance regulator (CFTR; 7q31.3) gene (Vuillaumier et al., 1997) among others. It 
was also instrumental in identifying the CArG sequence motif [CC(A/T rich),GG] 
important for the regulation of the cardiac a-actin (ACTC; 15q14) gene (Taylor 
et al., 1988). 

Phylogenetic footprinting has been used to identify regulatory elements in the 
promoter regions of the rapidly evolving SRY (Yp11.3) genes of 10 mammalian 
species (Margarit et al., 1998). A total of 10 putative regulatory elements were 
identified by these means (Figure 5.1). Interestingly, differences were apparent 
between the SRY promoters not only in terms of the presence/absence of specific 
motifs and motif copy number, but also in the spacing between motifs, their rela- 
tive location and orientation (sense or antisense strand). Conserved elements in 
the SRY gene promoters within each taxonomic group (primates, bovids, rodents) 
did however tend to occupy orthologous positions. 

The study of Vuillaumier et al. (1997) identified eleven DNase I-hypersensitive 
phylogenetic footprints in a 3.5 kb region of the CFTR gene (7q31.3) promoter 
when eight mammals from four different orders were compared. Two of these 
footprints, which corresponded to the cAMP response element and the PMA- 
responsive element, were conserved in all species examined, a finding which 
probably reflects the vital importance of transcriptional control through the 
cAMP and diacylglyerol pathways respectively. By contrast, a 300 bp 
purine.pyrimidine stretch, thought to represent a negative element of basal tran- 
scription, was found to be present only in the Cftr genes of rodents. Studies of the 
globin genes have also shown that some motifs may be of functional importance 
in one species but not be conserved between species, for example two Sp1-binding 
sites in the human HBB gene locus control region (Shelton et al., 1997). Such 
inter-specific differences are explored further in Section 5.1.4. 

A variation on this theme is differential phylogenetic footprinting which utilizes 
the promoter sequences of two extant species representing the most closely 
related lineages between which a difference in developmental expression pattern 
can be detected. An example of its successful use was in the study of the galago 
and human y-globin (HBG/) gene promoters (Gumucio et al., 1994; Section 
5.1.8). Differential phylogenetic footprinting has also been used to study the pro- 
moters of the apolipoprotein AI (4POA1; 11q23.3) genes of human and African 
green monkey in order to assess the functional significance of species-specific 
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Figure 5.1. Comparison of the promoter sequences from the SRY genes of ten mammals 
(redrawn from Margarit et al., 1998). Putative transcription factor binding sites are 
denoted by boxes above or below the sequence indicating their position on the sense or 
antisense strands respectively. Solid circles indicate gap positions present in the sheep 
and gazelle sequences as compared with the bull sequence. GATA; CMYB; NF1; 
BARBIE; vMYB; OCT1; SP1; AP1; SRY; GFI1. 


differences in expression. Sorci-Thomas and Kearns (1995) identified seven sites 
within the proximal promoter region of the APOA] gene which differed between 
human and African green monkey. These authors then tested the functional sig- 
nificance of these sequence changes by mutating the human promoter to the 
simian sequence and testing promoter strength by reporter gene expression assay. 
Substitutions at three of the sites (—189, —144 and —48 relative to the transcrip- 
tional initiation site) were found individually to increase the activity of the wild- 
type human promoter to ~60-65% of that of the African green monkey. In 
addition, two double mutations (—144/—48 and —189/—144) restored promoter 
activity to the same level as found in the monkey. Thus, we may therefore infer 
that several substitutions in the APOA/ gene must have occurred during primate 
evolution which together served to determine the specific level of APOAI gene 
transcription in the different species. 

Even if promoter elements are conserved, their relative location may not be. 
Thus the presence of four perhaps five DNase I hypersensitive sites in the B-globin 
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gene locus control region is conserved between placental mammals, marsupials 
and monotremes (Hardison et al., 1997) but the spacing between the elements dif- 
fers between species. 


5.1.2 Nonhomologous genes containing similar regulatory elements 


Some cis-acting elements are common to a wide variety of different genes and may 
be evolutionarily very ancient. Thus, the TGTGACGTCTTTCAGA cAMP- 
responsive element in the promoter of the human vasoactive intestinal polypep- 
tide (VIP; 6q26-q27) gene is similar not only to elements found in the promoters 
of nonorthologous murine and avian genes but also to sequences described in 
yeast and adenoviral promoters and to the E. coli consensus sequence recognized 
by the cAMP receptor protein (Lin and Green, 1989). 

Even if we confine ourselves to the same species, completely unrelated genes 
often also possess similar upstream regulatory elements. Thus, the promoters of 
the human E-selectin (SELE; 1q22-q25) and B-interferon JFNB1; 9p22) genes 
both contain NF«B, ATF-2 and HMG I(Y) binding sites (Whitley et al., 1994). 
Other examples of promoter similarity include the serum response elements pre- 
sent in the regulatory regions of the human apolipoprotein E (APOE; 19q13.2), c- 
fos (FOS; 14q24.3), and f6-actin (ACTB; 7pl2-pl15) genes or the 
AGGCGGCCCTTT motif in the apolipoprotein B (APOB; 2p24), apolipoprotein 
CIII (APOC3; 11q23-qter) and al-antitrypsin (PI; 14q32.1) genes (Surguchov, 
1991). It is as yet unclear if these elements have evolved by the slow fine tuning of 
existing sequences or if they have instead been introduced by promoter shuffling. 


5.1.3 Paralogous genes containing dissimilar regulatory elements 


Evolutionarily related genes in the same organism often differ in their expression 
profiles. Thus, the various paralogous members of the family of murine 
cytochrome P450-16 (testosterone 16-a-hydroxylase) genes are evolutionarily 
related yet are regulated quite differently (Wong et al., 1989). Similarly, the chro- 
mosomally unlinked human annexin I (ANX17/; 9q11-q22), VI (ANX6; 5q32-q34), 
and VII (ANX7; 10q21.1-q21.2) genes are evolutionarily related and encode pro- 
teins that are structurally and biochemically very similar. However, these genes 
possess very different sets of promoter and enhancer elements which are presum- 
ably responsible for their distinct patterns of tissue expression (Donnelly and 
Moss, 1998; Shirvan et al., 1994). Can we relate differences in expression profile of 
paralogous genes to changes in upstream regulatory elements? 

A number of studies of paralogous gene promoters have been performed and 
differences in promoter sequence between members of the same multigene family 
have indeed been found that may account for differences in transcriptional effi- 
ciency. Some paralogous promoters differ with respect to the presence or absence 
of specific regulatory elements. Others differ in terms of more subtle single base- 
pair substitutions introduced into specific cis-acting motifs which serve to 
increase or decrease the binding affinity of their cognate transcription factors. 
Such changes in promoter structure have contributed to post-duplicational gene 
diversification by providing the divergent gene products with their own specific 
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expression profiles. Steroid hormone receptor responsive elements (Chapter 7, 
section 7.5.2, The DNA-binding specificity of steroid receptors) are a classic example 
of this but a number of other studies of paralogous gene promoters have been 
reported which serve to illustrate the basic principles. 

The murine immunoglobulin variable region heavy chain (V,,) genes may be 
grouped into 15 families on the basis of coding sequence homology, and the pro- 
moter sequences are conserved in a family-specific manner (Buchanan et al., 
1997). However, these paralogous V,, gene promoters vary in terms of their tran- 
scriptional strength, at least in vitro, by as much as 60 to 70-fold. This variation 
appears to be due to several specific sequence differences between these promot- 
ers: (i) the presence/absence of a TATA box, (ii) the presence/absence of initiator 
elements, (iii) the extent of divergence of the octamer sequence element from its 
consensus (ATGCAAAT), and (iv) the spacing between the octamer motif and the 
heptamer motif located between 2 bp and 20 bp upstream (Buchanan et al., 1997). 

The human genome contains a cluster of 11 pregnancy-specific glycoprotein 
(PSG) genes on the long arm of chromosome 19q (PSG1, PSG2, PSG3, PSG4, 
PSGS, PSG6, PSG7, PSG8, PSG11, PSG12, PSG13). These genes are highly 
homologous and this homology extends to their promoter regions. The promoters 
of six PSG genes have been characterized and display differences based upon the 
minimal promoter required for optimal expression (Chamberlin et al., 1994). The 
class 1 PSG genes require nucleotides —172 to —34 (relative to the transcriptional 
initiation site) whereas the comparable minimal region of class 2 PSG genes lies 
between —172 and —80. This difference has been attributed to the presence of an 
imperfect SP1 binding site between —148 and —141 in the class 1 promoters. The 
perfect SP1 element (CCCCGCCC) present in the class 2 promoters is altered 
either to CCCTGCCC or CCCCACCC (NB. both represent mutations in a CpG 
dinucleotide compatible with the mechanism of methylation-mediated deamina- 
tion) in the class 1 promoters (Chamberlin et al., 1994). The SP1 binding sites of 
the class 1 promoters bind SP1 with lower affinity and require additional activa- 
tor elements for optimal expression (Chamberlin et al., 1994). 

One of the best characterized differences between paralogous gene promoters is 
that manifested by one of the human small proline-rich protein genes. The 
SPRRIA, SPRRIB, SPRR2A, and SPRR2C genes possess an Ets binding site at 
position —55 (relative to the transcriptional initiation sites) which is critical for 
promoter activity. The corresponding site in the SPRR3 gene has however been 
lost through a C—>T transition, although this loss appears to be compensated for, 
at least in part, by the presence of an Ets binding site at position —239 (Fischer et 
al., 1999). This single base-pair substitution has been shown to account for the 
lower rate of expression of the SPRR3 gene in cultured keratinocytes as compared 
with the SPRRIA and SPRR2A genes. This is despite the fact that the —239 Ets 
binding site in the SPRR3 gene has a higher affinity for its cognate transcription 
factor ESE-1 than the —55 sites in the SPRRIA and SPRR2A gene promoters. 
Fischer et al. (1999) found that removal of the — 239 Ets binding site coupled with 
the restoration of the —55 binding site by a T>C mutation yielded a promoter 
activity comparable to that of the SPRRIA and SPRR2A promoters and three- 
fold higher than that of the wild-type SPRR3 promoter. The location of the Ets 
binding site in the SPRR3 gene therefore appears to be critical for determining 
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the activation potential of ESE-1. In the SPRR3 promoter, three transcription 
factor binding sites (AP-1, ATE, and Ets) function cooperatively but these sites are 
not interdependent as in the SPRR2A promoter. Thus, the loss of the proximal 
Ets binding site in the SPRR3 gene promoter during the evolution of the gene 
family may have been responsible for changing the promoter from one which 
required highly synergistic interactions between its cognate transcription factors 
to one which functions with less stringent, cooperative interactions. 

Although the diversity of glycoproteins encoded by the HLA-B locus is greater 
than that encoded by the HLA-A locus, the HLA-B locus promoter sequences are 
more homogeneous than those of the HLA-A locus (Vallejo and Pease, 1995). We 
may speculate that selection for diversity of B glycoproteins may be directly 
related to the conservation of the promoter sequences and vice versa for the A gly- 
coproteins and their gene promoters. Although the human HLA-A (6p21.3) and 
HLA-B genes are coordinately expressed, they differ in their responsiveness to 
interferon. A functional interferon-responsive element (IRE) is present in the 
promoters of the HLA-B genes in the great apes but this IRE has been inactivated 
by an A —> T transversion in the promoters of the HLA-A genes (Vallejo et al., 
1995). Since this lesion is present in the HLA-A genes of orangutan, gorilla, chim- 
panzee and human, it is likely to have occurred before the divergence of the great 
apes. 

The chorionic gonadotropin ß gene (CGB; 19q13.32) has evolved by dupli- 
cation of an ancestral luteinizing hormone f gene and is expressed exclusively 
in the placenta. It exhibits some 94% homology with the pituitary expressed 
luteinizing hormone-f (LHB; 19q13.32) gene between coding regions and 90% 
homology between promoters (Hollenberg et al., 1994). Both LHB and CGB 
genes possess TATAAA boxes at identical locations. This motif directs the ini- 
tiation of an LHB transcript with a 9 nucleotide 5’ UTR. The CGB gene does 
not however use this box for transcriptional initiation but instead employs a 
TATA-less promoter (with putative initiator element) that serves to initiate 
transcription 357 bp upstream of the LHB transcriptional initiation site 
(Jameson et al., 1986). This difference in transcription site usage is evolution- 
arily conserved, being found in rat, cow and human (Jameson et al., 1986). The 
experimental exchange of motifs between the two promoters has allowed the 
identification of three distinct regions between —362 and +104 necessary for 
placental expression of the CGB gene (Hollenberg et al., 1994). It would appear 
as if the placental expression pattern of this gene has resulted from the use of 
an alternative transcriptional initiation site in conjunction with the acquisi- 
tion of multiple regulatory elements that exert their effects in combinatorial 
fashion. 

The paralogous members of the human growth hormone gene family on chro- 
mosome 17q23 differ in terms of their promoter sequences: a T — C transition at 
position —112 of the CSH/ gene promoter serves to reduce binding of the pitu- 
itary-specific transcription factor Pit-1 (Tansey and Catanzaro, 1991). By contrast, 
the paralogous GH1, GH2 and CSH2 genes possess a T at this location within the 
Pit-1 binding site. Pit-1 binding is thought to represent a negative control mech- 
anism by virtue of its interference with the binding of SP1 to an adjacent site in 
the promoter. The T->C transition in the CSH1 gene may thus abrogate this 


228 HUMAN GENE EVOLUTION 


negative control mechanism since it facilitates SP1 binding and promoter gene 
activation by SP1. 

Several other examples of promoter differences in paralogous genes are known. 
Thus the human yB (CRYGB; 2q33-q35) and yC (CRYGC; 2q33-q35) crystallin 
genes possess similarly located TATA boxes but the CRYGC gene lacks the 
CCAAT box present in the CRYGB gene (Graw et al., 1993). The expression of the 
human main-type histone H1 (HIF1; 6p21.3) genes is coordinated with DNA 
replication whereas the regulation of the replacement H1 subtype H1° (H1F0; 
22q13) gene is more complex. This difference appears to be reflected in the struc- 
tures of the respective gene promoters with a CCAAT box present in the main-type 
histone H1 gene promoters being absent in the H1° gene promoter (Doenecke et 
al., 1994). Similarly, the promoters of the human mammaglobin 1 (MGB1) and 2 
(MGB2) genes, which are closely linked to the related uteroglobin (UGB) gene on 
chromosome 11q13, are homologous for the first 132 bp but then exhibit major dif- 
ferences that are probably responsible for the different expression patterns of these 
genes (Becker et al., 1998). Finally, the highly homologous human placental 
(ALPP;; 2q37) and intestinal (ALPI; 2q37) alkaline phosphatase genes possess sev- 
eral nucleotide substitutions and deletions in their 5’ flanking regions which could 
account for their differing tissue specificity (Knoll et al., 1988). 


5.1.4 Orthologous genes containing dissimilar regulatory elements 


Differences between orthologous gene promoter sequences have also been 
described which may explain differences in expression of the same gene in differ- 
ent species. In some cases, the orthologous promoters differ with respect to the 
presence or absence of specific cis-acting sequences or alternatively in terms of the 
number of such motifs. In other cases, single base-pair substitutions have been 
introduced that alter the affinity of the cis-acting sequences for their cognate tran- 
scription factors. Orthologous genes may therefore acquire different expression 
profiles in different species. Thus, the expression of the human G protein-coupled 
receptor 1 (GPR1) gene (15q21.6) which is hippocampus-specific contrasts with 
that of its rat counterpart which is not expressed in the hippocampus (Marchese 
et al., 1994). The human regulatory myosin light chain (MYLS; 4p16.3) gene is 
expressed in human adult retina and fetal muscle but not in adult skeletal muscle 
suggesting that this gene is developmentally regulated (Collins et al., 1992). By 
contrast, the orthologue of this gene is abundantly expressed in the adult skeletal 
muscle of the African green monkey, providing a good example of a gene that is 
differentially expressed between humans and another primate. The tissue-spe- 
cific expression of many genes is known to be determined by the presence of 
enhancer elements that bind tissue-specific transcription factors. The evolution 
of novel tissue specificities can therefore be investigated by comparison of the 
sequences of gene regulatory elements between different species. 

The human apolipoprotein Al (APOA1; 11q23) gene is expressed predomi- 
nantly in the liver whilst its rabbit counterpart is expressed predominantly in the 
intestine. The rabbit Apoal gene promoter contains two cis-acting elements, E2 
(GGAGAAGAGAGGTCA) and E3 (GAAAGTCTCTCTTCTGTT) both similar 
to enhancer sequences found in murine immunoglobulin genes, that are absent 
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from the human APOA] gene promoter (Bochkanov et al., 1990; Higuchi et al., 
1988; Pan et al., 1987). These sequences may help to explain the differing tissue 
specificity of this gene between the two species. 

The rat tissue-type plasminogen activator gene differs from that of mouse and 
human (PLAT; 8p12) in that it is induced by gonadotropins via a cAMP-depen- 
dent pathway. The rat Plat gene promoter possesses a cCAMP-responsive element 
(CRE) but, at the corresponding location in the murine and human gene promot- 
ers, a single base-pair change is present that serves to reduce drastically the bind- 
ing affinity to CRE-binding protein (Holmberg et al., 1995). This difference is 
sufficient to account for the unresponsiveness of the human and murine gene pro- 
moters to forskolin and follicle-stimulating hormone. 

The promoters of the murine and human lactoferrin (LTF; 3p21.2-p21.3) genes 
have numerous cis-acting regulatory motifs in common. However, in the bovine 
lactoferrin gene, several of these motifs (GATA-1, Oct-1, COUP, AP-2, and gluco- 
corticoid and acute phase response elements) are absent, providing a possible 
explanation of the relatively weak expression of bovine lactoferrin in comparison 
to human and mouse (Seyfert et al., 1994). 

Both tumor necrosis factor and lipopolysaccharide stimulate the expression 
of P-selectin in murine endothelial cells but this does not occur in human 
endothelial cells. This difference is thought to be explicable in terms of structural 
differences in the promoters of the P-selectin genes between the two species (Pan 
et al., 1998). The promoter of the human P-selectin (SELP; 1q23-q25) gene pos- 
sesses a unique NF«B binding site which is thought to be specific for p50 or p52 
homodimers. Its murine counterpart contains two tandem NF«B binding sites 
and a variant activating transcription factor/cAMP response element which is 
similar to that found in the E-selectin gene required for tumor necrosis factor - 
and lipopolysaccharide-inducible expression. 

The promoters of the human and murine XRCC5 DNA repair genes (2q35) dif- 
fer with respect to the number of copies of a 21 bp near-perfect palindromic ele- 
ment (Kpb); the mouse Xrcc5 gene possesses a single copy whereas the human 
XRCCS5 gene possesses seven copies (Ludwig et al., 1997). Amplification of this 
cis-acting element occurred in the human lineage and this may help to account for 
the human XRCCS promoter being at least three to five-fold stronger than its 
murine counterpart (Ludwig et al., 1997). 

Calbindin-D,, is a mammalian-specific cytosolic calcium-binding protein 
which is encoded by a gene (CALB3; Xp) that is ubiquitously expressed in the 
intestine. In the rat, this gene is also expressed in the uterus under the control of 
estrogen whose effects are mediated by an estrogen response element (ERE) 
(Darwish et al., 1990). By contrast, uterine expression is not found in human or 
baboon. In these primates, two nucleotide substitutions are present that have 
abolished the binding affinity of the ERE to the estrogen receptor and probably 
account for the loss of uterine expression (Jeung et al., 1994; 1995). 

The chorionic gonadotropin a-chain gene is expressed in the pituitary of all 
mammals but is expressed in the placenta only in primates and horses. Since 
horses and primates are only distantly related, and species such as the cow and 
rodents which are more closely related to horses than primates do not express the 
gene in placenta, it would appear as if the placental expression of the gene has been 
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independently derived, perhaps by the convergent evolution (Chapter 4, section 4.3) 
of placenta-specific enhancer elements. Support for this postulate has come from a 
comparison of human and equine chorionic gonadotropin o-chain (CGA; 6q21.1- 
q23) gene promoter sequences (Steger et al., 1991). The human gene contains a tro- 
phoblast-specific element (TSE) and two copies of a cAMP response element 
(CRE) all of which are required for full placental expression. Both the CRE and 
TSE are bound by the leucine zipper-containing protein, CREB. By contrast, 
cAMP regulation of the equine Cga gene is not mediated by CREB but instead by 
aACT, a GATA-related protein that binds to the promoter at a site distinct from the 
CREB-binding site. It would thus appear as if the conversion of the CGA gene 
from being pituitary-specific to being also placentally expressed was a consequence 
of independent evolutionary processes in primates and horses. 

In the New World monkey, Cebus apella, the y2-globin (HBG2) gene is 
expressed at a 20-fold higher level than the closely linked yl-globin (HBG/) gene 
(Johnson et al., 1996). The most obvious difference between the promoters of the 
two genes and the one most likely to account for the difference in expression, is a 
CCAAC motif instead of the canonical CCAAT found in the HBG/ promoter 
(Chiu et al., 1996). Intriguingly, the HBG]1 gene has been inactivated by deletion 
in the Atelidae whereas in the Pitheciini, the CCAAT box has been changed to 
CCGAT (Chiu et al., 1996). In the New World monkey Aotus azarae, a single 
hybrid HBG gene has been created as a result of an unequal crossing over event 
(Chiu et al., 1996); this gene possesses the promoter and 5'UTR of the HBG/ gene 
coupled to the coding region of the HBG2 gene. Why the HBG/ gene has been 
inactivated several times independently in platyrrhines is unclear; expression 
from both HBG] and HBG2 genes has been preserved in catarrhines (Chapter 4, 
section 4.2.1, Globin genes). 

The human prolactin (PRL; 6p22) gene possesses two promoters, the proximal 
one reponsible for directing expression in the anterior pituitary, the distal one for 
expression in the decidualized endometrium and the mammary gland. Sequences 
homologous to both promoters are present in the rat Pri gene promoter but the 
distal sequence is nonfunctional (Shaw-Bruha et al., 1998). Whether the human 
PRL gene has gained a functional promoter in the last 100 Myrs since the diver- 
gence of our common ancestors, or whether the distal promoter in the rat gene has 
ceased to function during this time, is unclear. 

Other examples of inter-specific differences in orthologous promoter regions 
include the thymidine kinase 1 (TK1; 17q25.2-q25.3) gene promoter which con- 
tains functional CCAAT elements in humans, chickens and Chinese hamsters but 
not in mice and rats (Arcot et al., 1991) and the osteonectin (SPARC; 5q31-q32) 
gene promoter whose GGA-box sequences contribute to cell-type specific expres- 
sion in human but not in the bovine (Hafner et al., 1995). The mammalian 
SPARC gene also differs from its Xenopus counterpart in that the latter contains a 
TATA box but lacks a GGA-box (Damjanovski et al., 1998). 

Various species-specific sequence differences have also been reported in nuclear 
factor binding sites in the promoter regions of the D7Rp2e genes in Mus domesti- 
cus and M. pahari that alter both the pattern of binding site occupancy and the 
ability of the bound factors to repress transcription (Singh and Berger, 1998; 
Singh et al., 1998). 
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Various examples of the recruitment of metabolic enzymes to a novel lens-spe- 
cific role have been documented among the taxon-specific crystallins (see Chapter 
4, section 4.2.1, Crystallin genes). Recruitment of these genes to a new lens function 
has been achieved by acquiring the potential for lens expression by the develop- 
ment of novel promoter elements. Two strategies appear to have been employed 
viz. the modification of (i) distinct regions of the same non-functional intronic 
sequence to perform a role in lens-specific expression and (ii) pre-existing pro- 
moters previously utilized for nonlens tissue expression (see Chapter 4, section 
4.2.1, Crystallin genes). 

Not surprisingly, some differences between orthologous promoters are appar- 
ently neutral and do not obviously affect promoter function. One example of this 
is the 54 bp insert in the promoter of the human liver arginase (ARG/; 6q22.3- 
q23.1) gene which is absent from the same gene in macaques (Goodman et al., 
1994). 

Not all regulatory sequences occur upstream of the transcriptional initiation 
site. Some occur in introns (Chapter 3, section 3.1) but such regulatory elements 
are not necessarily conserved between orthologous genes. Thus, the 83 bp intron 
1 of the human CD68 (17p13) gene contains a macrophage-specific enhancer but 
the equivalent intron of the orthologous murine (macrosialin; Ms) gene does not 
despite ~80% nucleotide sequence homology (Greaves et al., 1998). 


5.1.5 Bidirectional promoters 


There is now a growing list of divergently transcribed gene pairs arranged in 
head-to-head fashion separated by a bidirectional promoter. Such an organization 
has often but not always arisen through a process of gene duplication followed by 
inversion. In true cases of a bidirectional promoter, the gene promoters overlap 
and contain common elements that can allow the coordinate regulation of expres- 
sion of both genes. Bidirectional promoters are often associated with CpG islands 
which therefore serve as useful markers for their location (Brenner et al., 1997; 
Lavia et al., 1987). 

Examples of bidirectional promoters are thought to include the human histone 
H2A and H2B (1q21-q23) genes (Hentschel and Birnstiel, 1981), type IV collagen 
(COL4AI1 and COL4A2, 13q34, Figure 5.2; COL4A3 and COL4A4, 2q36-q37; 
COL4A5 and COL4A6, Xq22) genes (Schmidt et al., 1993; Oohashi et al., 1995), 
and the TAPI and LMP2 (6p21.3) genes (Wright et al., 1995). These gene pairs 
encode proteins that are involved in the same biological processes whether as 
components of multi-chain proteins (collagens) or protein complexes (histones) 
or as proteins with associated functions (both TAP1 and LMP2 have a role in 
antigen processing). This is no coincidence since evolutionarily related pairs of 
genes that have arisen by a process of gene duplication/inversion would be pre- 
dicted to have similar functions. As a consequence of their mode of creation, they 
will often possess the potential for coordinate regulation as long as the promoter 
region(s) are still intact and the genes remain closely linked. In principle, the 
newly created functional redundancy of the promoter elements (Section 5.1.11) 
can be reduced by element removal and subsequent element sharing while still 
retaining the potential for coordinate regulation (Section 5.1.14). Alternatively, 
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retention of the functional redundancy could allow diversification of promoter 
elements leading to independent regulation. 

There are examples of divergently transcribed gene pairs that are not obviously 
related evolutionarily and which are unlikely to have arisen by gene duplication 
and inversion. Thus the human phosphoribosylaminoimidazole carboxylase 
(PAICS) and phosphoribosylpyrophosphate amidotransferase (PPAT) genes, 
which encode enzymes of the de novo purine biosynthesis pathway, are only 229 bp 
apart on chromosome 4q12 but encode products that are not structurally related 
(Brayton et al., 1994). Similarly, the human isocitrate dehydrogenase 3 JDH3G) 
and signal sequence receptor 8 (SSR4) genes on chromosome Xq28 are separated 
by a 133 bp sequence of CpG island character, share overlapping promoter ele- 
ments but encode proteins which have no obvious common function or evolu- 
tionary history (Brenner et al., 1997). Finally, some genes may be transcribed in 
opposite directions but do not necessarily share promoter elements nor encode 
proteins which share any obvious function, for example the human minichromo- 
some maintenance 4 (MCM4) and DNA-activated protein kinase (PRKDC) genes 
which are separated by ~700 bp on chromosome 8q11 (Connelly et al., 1998). 


5.1.6 5’ and 3’ untranslated regions of genes 


The 5’ and 3’ untranslated regions (UTRs) of vertebrate genes are often evolu- 
tionarily conserved (Duret et al., 1993; Lipman 1997) but to a lesser extent than 
their corresponding coding sequences. Thus, whereas the average degree of 
nucleotide sequence identity shared between the coding regions of human and 
murine genes is ~85%, the 5’ and 3’ UTRs exhibit sequence identities of 67% and 
69% respectively (Makalowski et al., 1996). In practical terms, this degree of 
sequence divergence is sufficient to allow the discrimination of different mam- 
mals simply by reference to their UTR sequences (e.g. Soteriou et al., 1995). 

Within the 5’ and 3’ UTRs, highly conserved regions (HCRs) can be found. These 
were defined by Duret et al. (1993) as sequences of at least 100 bp that exhibit >70% 
homology between species, and which diverged more than 300 Myrs ago. Since 
such sequences would be expected, in the absence of selective pressure, to share only 
~30% similarity, evolutionary conservation implies function. Comparison of mam- 
malian and avian genes reveals that ~17% and ~30% of genes contain HCRs in 
their 5’ UTRs and 3’ UTRs respectively (Duret et al., 1993). When mammalian and 
fish genes are compared, the proportion of genes whose 3’ UTRs possess an HCR 
falls to 5% (Duret et al., 1993). Since HCRs occur relatively infrequently within 
introns, the evolutionary constraints would appear to operate at the level of the 
mature mRNA (Duret et al., 1993). The HCRs in the 5’ UTRs of the creatine kinase 
B, c-jun and actin genes span the CCAAT and TATA boxes and are therefore likely 
to play a role in transcriptional regulation (Duret et al., 1993). Other HCRs in the 5’ 
UTRs of the transforming growth factor B3 and ferritin heavy chain genes span ele- 
ments known to be involved in translation (Duret et al., 1993). Some HCRs in 3’ 
UTRs are thought to play a role in mRNA degradation (e.g. c-fos and transferrin 
receptor genes) whereas others may be important for mRNA transport and transla- 
tion (Duret et al., 1993). HCRs in 3’ UTRs appear to be preferentially associated 
with widely expressed genes especially those encoding DNA-binding proteins and 
cytoskeletal proteins (Duret et al., 1993). 
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The human surfeit 2 and 4 (SURF2, SURF4; 9q34.1) genes differ from their 
murine counterparts in that whilst the mouse genes overlap by 133 bp in their 3’ 
UTRs, the human genes are separated by 302 bp (Duhig et al., 1998). The human 
SURF2 gene contains two alternative polyadenylation sites resulting in short 3’ 
UTRs of 17 and 25 bp whereas the mouse Surf2 gene contains a 359 bp 3’ UTR. 
The much shorter human SURF2 3' UTR probably accounts for the absence of 
overlap with the human SURF4 gene. 

One example of the emergence of a functional difference between the 3’ UTRs 
of two paralogous human genes is provided by the a-globin (HBA1; 16p13.3) and 
C-globin (HBZ; 16p13.3) genes (Russell et al., 1998). The HBAI and HBZ genes 
are coexpressed in the embryonic yolk sac. A switch to exclusive expression of the 
HBAI gene in the fetus and adult involves the developmental silencing of the 
ABZ gene. Silencing is achieved both by transcriptional control but also through 
a post-transcriptional mechanism that serves to reduce the relative stability of 
HBZ mRNA. The HBAI and HBZ genes both assemble an mRNP stability- 
determining complex on their 3’ UTRs but these complexes form with different 
affinities on the two genes. The diminished efficiency of complex assembly on the 
HBZ 3' UTR results from a CG transversion in a polypyrimidine tract that is 
common to both genes. This substitution is associated with a shortened poly(A) 
tail on the HBZ mRNA that may mediate accelerated HBZ mRNA decay. 

Another example of a functional difference between the 3’ UTRs of evolution- 
arily related human genes is provided by the human alcohol dehydrogenase 
(ADH2; 4q22) gene which differs from the other paralogous ADH family mem- 
bers by virtue of a T—C transition within a canonical polyadenylation site 3’ to 
the gene. This substitution appears to be at least in part responsible for the use of 
alternative polyadenylation sites leading to the formation of multiple ADH2 
mRNAs (Trezise et al., 1989). 

Alterations in the 5’ UTR may also influence the expression pathway. For 
example, the efficiency of mRNA translation of the renal ornithine decarboxylase 
(Odc) gene is significantly lower in Mus pahari as compared to Mus domesticus 
(Johannes and Berger, 1992). This is thought to be due to the acquisition of sev- 
eral single nucleotide substitutions and a 12 bp deletion in the 5’ UTR of the Odc 
gene in M. pahari. These sequence changes are predicted to alter the secondary 
structure of the mRNA molecule and this may influence translation efficiency. 


5.1.7 Inter-specific differences in promoter selection 


Three distinct mRNA species (L1, M and L2 respectively) are generated from the 
human aldolase A (ALDOA; 16q22-q24) gene via the differential incorporation of 
three exons encoding the 5’ UTR (Mukai et al., 1991; Figure 5.3). The production 
of the three mRNAs is controlled by three different promoters (Figure 5.3) which 
are utilized singly, doubly or all together depending upon the expressing tissue. 
The DNA sequences corresponding to these promoters are present in the rat gene 
but the L1 promoter is not utilized (Mukai et al., 1991). Thus the L1 promoter has 
either been acquired in the human lineage or, (perhaps more likely) lost in the rat 
lineage since the divergence of primates and rodents. 

The vast majority of studies that have mapped transcriptional initiation sites 
have been performed on cultured cells and it is by no means clear that those sites 


PROMOTERS AND TRANSCRIPTION FACTORS — CHAPTER5 235 


o 2 4 6kb 
l l l l 

p p 

HA | Hi EE 
12 3 4 56 78 9101112 
(L1) (M) (L2) 


Figure 5.3. Exon/intron distribution in the human aldolase A (ALDOA) gene. Exons 1, 
3, and 4 correspond to leader exons L1, M2, and L2. The positions of the three 
alternative promoters are denoted by p (redrawn from Mukai et al., 1991). 


identified by in vitro studies are actually always utilized in vivo. White et al. (1998) 
identified the in vivo transcriptional initiation sites of the cystic fibrosis trans- 
membrane conductance regulator (CFTR; 7q31.3) gene in both human and 
mouse. Tissue-specific variation in the position of the transcriptional initiation 
sites was noted in both species but the sites were not conserved between equiva- 
lent tissues. This finding suggests that the precise mechanism of transcriptional 
initiation for a given gene may not be absolutely conserved between species. 


5.1.8 Developmental changes in gene expression 


In higher primates (both platyrrhine and catarrhine), the y-globin (HBGI, 
HBG2; 11p15.5) genes are expressed during fetal life whereas in nonprimates and 
prosimians, the genes are expressed in the embryo. The conversion of the HBG/ 
and HBG2 genes to a fetal pattern of expression must therefore have occurred 
after the divergence of simians from prosimians some 55 Myrs ago. Implicit in 
this conversion is the activation of the y-globin gene in fetal life (a stage at which 
it was previously repressed) and repression of the y-globin gene in embryonic life 
at which stage it was previously active. Prosimians are characterized by the pos- 
session of only one y-globin gene whereas higher primates possess two copies. The 
promoter of one of the duplicated y-globin genes may thus have been able to 
escape the influence of natural selection and in so doing accumulate mutations 
that served to alter the timing of its developmental expression. Any beneficial 
changes thus acquired could then have been readily transferred to the other y-glo- 
bin gene by gene conversion. 

A burst of sequence change occurred after the divergence of simians from 
prosimians but before the divergence of Old World monkeys from New World 
monkeys (Fitch et al., 1990, 1991). Most of these changes were then conserved 
during the subsequent evolution of the simian y-globin genes. Studies of the y- 
globin gene promoter in transgenic mice have shown that all the sequence 
changes necessary for changing to the fetal pattern of gene expression are located 
within a 4 kb fragment containing the y-globin gene (TomHon et al., 1997). 
However, within 260 bp of the transcriptional initiation site, there are 19 
nucleotide changes specific to simians, 16 of which are located in or near highly 
conserved sequence motifs (Chiu et al., 1997; Fitch et al., 1990; 1991). Further, 
comparison of the human and galago (a prosimian) sequences reveals 57 
nucleotide differences over the same region with the majority again being located 
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near highly conserved sequence motifs. It may be some time before the precise 
sequence changes responsible for the change to fetal expression are unequivocally 
determined. 

One way of approaching this question has been through the use of ‘differential 
phylogenetic footprinting’ (Section 5.1.1) and gel retardation analysis. By these 
means, Gumucio et al. (1994) identified several proteins (G1, G2, G3, and G4) that 
bound the galago sequence but did not bind the corresponding human sequence. 
Phylogenetic reconstruction and gel retardation analysis were used to demon- 
strate that the promoter sequence of the embryonically expressed y-globin gene of 
the primate common ancestor would have bound proteins G1 and G2 some 4-6- 
fold more strongly than the promoter of the fetally expressed simian ancestor 
(Gumucio et al., 1994). The binding strength of these proteins correlated with 
repression of promoter activity in vitro, suggesting that the loss of the binding sites 
for these proteins in the ancestral simian y-globin gene could have potentiated the 
conversion of this gene to a fetal onset of expression (Gumucio et al., 1994). 


5.1.9 Promoter polymorphisms 


Promoter polymorphisms affecting the expression of the downstream gene are 
probably not infrequent. However, as yet, relatively few examples of promoter 
polymorphisms in human genes have been properly characterized by means of 
functional (e.g. reporter gene) studies. Examples of such polymorphisms include 
plasminogen activator inhibitor type 1 (PAIL, 7q21.3-q22.1; Dawson et al., 1993), 
tumor necrosis factor a (TNF; 6p21.3; Wilson et al., 1997), apolipoprotein AI 
(APOAI, 11q23.3; Angotti et al., 1994), lipoprotein(a) (LPA, 6q27; Suzuki et al., 
1997; Wade et al., 1993), lipoprotein lipase (LPL; 8p22; Hall et al., 1997), inter- 
leukin 6 (IL6, 7p21-p15; Fishman et al., 1998), factor VII (F7; 13q34; Pollak et al., 
1996), hormone-sensitive lipase (LIPE; 19q13; Talmud et al., 1998) and 
monoamine oxidase A (MAOA, Xp11.23; Sabol et al., 1998). The presence of poly- 
morphisms in gene promoter regions is not unusual per se since all gene regions 
harbor polymorphisms. Indeed, such variants are quite consistent with, and 
explicable in terms of, a neutralist model. However, it is possible that those poly- 
morphisms in the promoter region which specifically affect gene expression con- 
fer, or have conferred, a selective advantage. 

The PAI] promoter polymorphism constitutes the insertion or deletion of a 
single G residue at position —675 (Dawson et al., 1993). The ins allele contains an 
interleukin 1-responsive element which is not present in the del allele suggesting 
that individuals homozygous for the del allele may exhibit an altered PAI-1 
response during the acute phase reaction (Dawson et al., 1993). Similarly, a com- 
mon insertion polymorphism (G) at position — 1607 in the human matrix metal- 
loproteinase-1 (MMPI; 11q22-23) gene promoter creates a binding site for 
members of the Ets family of transcription factors which results in the increased 
transcription of the gene (Rutter et al., 1998). 

A highly unusual 76 bp length polymorphism (f = ~ 0.20/0.80) in the human 
antithrombin III (AT3; 1p31.3-qter) gene promoter results from the alternative 
presence of two apparently distinct sequences of 108 bp (L) and 32 bp (S) at the 
same position, ~345 bp upstream of the ATG translational initiation codon (Bock 
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and Levitan, 1983). How did this unusual polymorphism evolve and become 
established in the general population? Winter et al. (1995) noted that the sequence 
flanking the L allele contained numerous homologous motifs suggestive of an 
ancient duplication event. Some residual homology was also noted between the L 
and S alleles (Figure 5.4) but the L allele did not obviously contain duplicated 
sequence. The most parsimonious explanation was therefore the emergence of the 
S allele from the L allele by partial deletion followed by sequence divergence. The 
deletion event(s) could have occurred by homologous recombination mediated by 
the homologies between the L-specific sequence and the region immediately 
downstream. Interestingly, however, no difference in expression could be dis- 
cerned between the two alternative alleles (Winter et al., 1995). 

A repeat length polymorphism in the human solute carrier family member 
SLC6A4 (17q11.1-q12; Delbriick et al., 1997) gene also appears to be present in 
the gorilla although not in the chimpanzee. It is unclear whether this is an exam- 
ple of an ancient trans-species polymorphism (Chapter 1, section 1.2.2) or 
whether it has arisen independently in the two lineages. 
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L CTGATTTAGTTAACGAGAAACAAAAAATCCTGCAGACAAGTTTC TCCTCAGTCAGGTA 
T A AAC CAAGTTT TCIT GT AG 
S TGGGTATGAAC CAAGTTTGTTTCCTTGGTTAG 
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Figure 5.4. Alignment of the alternative L- and S-specific sequences in the promoter 
region of the human AT3 gene indicating regions of homology (redrawn from Winter et 
al., 1995). For explanation, see text. 
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Present human locus 
Figure 5.5. Evolution of the human apolipoprotein E2/CI/CIV/CII loci (adapted from 
Allan et al., 1995). The relative locations of the human APOE, APOC1, APOC2, and 
APOC4 genes and the APOCI1P1 pseudogene are shown by closed boxes. The locations 
of the two hepatic control regions (HCR) are denoted by ovals. 
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5.1.10 Promoter duplication 


Promoters may occasionally be duplicated in the absence of the duplication of 
their associated genes. Thus the BRCA/ (17q21) and 1A1.3b (M1752; 17q21.1) 
gene promoters are duplicated in tandem fashion (Barker et al., 1996) but the 
functional significance of this observation is unknown. 

Two copies of a motif known as the hepatic control region (HCR), located, 
respectively, 15 kb and 5 kb downstream of the human APOE and APOCI genes 
on chromosome 19q13.2, arose by a regional duplication event which encom- 
passed the APOC! gene ~40 Myrs ago (Raisonnier et al., 1991; Figure 5.5). Whilst 
the duplicated APOCI gene has become a pseudogene (APOCIP1), the dupli- 
cated HCR (HCR2) has retained 85% homology with HCRI and appears to be able 
in its new location to direct the transcription (albeit infrequent transcription) of 
a sequence resembling an exon which is spliced to APOC4 5’ exons even in the 
absence of conventional promoter elements (Allan et al., 1995). 


5.1.11 Functional redundancy of promoter elements 


The human 7SK RNA gene (RN7SK; chromosome 6) contains a proximal 
sequence element (PSE) between —49 and —65 and a distal sequence element 
(DSE) between —243 and —210 which display sequence similarity and functional 
homology (Boyd et al., 1995). The PSE can retain function after extensive muta- 
tion but only if the DSE is intact. How this apparent functional redundancy has 
been maintained is unclear. 

The promoter of the human ¢-globin (HBE/; 11p15.5) gene displays functional 
redundancy of a different type; it contains eight YY1 binding sites, five binding 
sites for a putative stage selector protein and seven binding sites for a hitherto 
unidentified protein (Gumucio et al., 1993). Other probable examples of functional 
redundancy include the eight SP1 sites present in the human muscle phospho- 
fructokinase (PFKM; 12q13.3) gene promoter (Johnson and McLachlan, 1994). 
Without detailed functional studies, however, it is unclear if the multiple cis-acting 
regulatory elements are truly redundant or if each additional copy leads to an 
incremental increase in transcription factor binding potential and hence promoter 
strength. 


5.1.12 Recruitment of repetitive sequences as promoter and silencer 
elements 


Alu sequences. A number of examples of the recruitment of repetitive sequence 
elements, mostly Alu sequences, as gene promoter and silencer elements have 
been studied (reviewed by Britten, 1996; 1997; Robins and Samuelson, 1992; 
Table 5.1). Over evolutionary time, the insertion of Alu repetitive sequence ele- 
ments in the vicinity of genes has served to introduce different motifs that have 
altered the expression level or tissue specificity of the associated gene either 
immediately or after subsequent fine tuning by natural selection. One such motif 
is a retinoic acid response element (RARE) in the human KRT18 (12q13) gene: 
three hexamer half sites, related to the consensus AGGTCA, arranged as direct 
repeats with a spacing of 2 bp (Vansant and Reynolds, 1995). These sites are 
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Table 5.1. Repetitive sequence insertions in mammalian genes that have altered the expres- 


sion of the gene 


Gene Insertion Comments Reference 
KRT18 Alu Alu sequence contains Thorey et a/. (1993); 
retinoic acid receptor Vansant and Reynolds (1995) 
binding site 
CD8A Alu Alu sequence carries two Hambor et al. (1993) 
Lyf-1, one bHLH and one 
GATA-3 binding sites 
FCER1G Alu Alu sequence carries Brini et a/. (1993) 
positive and negative 
elements 
PTH Alu Alu sequence carries McHaffie and Ralston (1995) 
negative calcium 
response element 
BRCA1 Alu Alu sequence serves Norris et a/. (1995) 
as estrogen receptor- 
dependent enhancer 
MPO Alu Alu sequence contains Piedrafita et a/. (1996) 
composite SP1-thyroid 
hormone-retinoic acid 
response element 
W771 Alu Silencer in intron 3, Hewitt et a/. (1995) 
12 kb from promoter 
F9 LINE Located at position — 800 Kurachi et a/. (1997) 
LPA LINE Located ~20 kb Yang et a/. (1998) 
upstream of transcriptional 
initiation site; contains 
enhancer element 
AMY1A, AMY1B, Endogenous Androgen response Ting et a/. (1992) 
AMY1C retrovirus element 
ZNF80 Endogenous ERV9 element Di Cristofano et al. 
retrovirus (1995a); (1995b) 
PLA2L Endogenous HERV-H element Kowalski et a/. (1996); (1997) 
retrovirus 
PTN Endogenous HERV element Schulte and Wellstein (1998) 
retrovirus 
HLA-DRB6 Endogenous MMTV-like sequence Mayer et a/. (1993) 
retrovirus 
Sex-linked C4 Endogenous ERV element contains Stavenhagen and Robins (1988); 
protein, murine retrovirus androgen-responsive Adler et a/. (1992, 1993); 


sites Robins et a/. (1994) 


Sequences are human unless otherwise specified. 
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capable of binding the retinoic acid receptor and functioning as RAREs in tran- 
siently transfected cells (Vansant and Reynolds, 1995). We may thus surmise that 
the random insertion of thousands of Alu sequences in the primate genome could 
have altered the expression of numerous genes over evolutionary time. 

An Alu sequence in the last intron of the human CD8A (2p12) gene operates so 
as to modulate the activity of an adjacent T lymphocyte-specific enhancer 
(Hambor et al., 1993). The Alu sequence appears to contain four functional tran- 
scription factor binding sites [Lyf-1 (2), bHLH (1), GATA-3 (1)]. Hambor et al. 
(1993) noted seven (non-CpG) nucleotide differences by comparing this Alu 
sequence with its probable source gene. Two of these differences were in the 
GATA-3 binding site and both were shown by site-directed mutagenesis to be 
necessary for its function (Hambor et al., 1993). This was therefore proposed to be 
a possible example of positively selected change in an inserted Alu sequence. 
However, since the Alu sequence appears to be capable of modulating the activity 
of the enhancer through the formation of a cruciform (stem-loop) structure with 
another downstream Alu sequence (Hanke et al., 1995), it is unclear what if any 
role the putative binding sites might have in modulating enhancer activity. 

An Alu sequence in the promoter region of the gene (FCERIG; 1q23) encoding 
the gamma chain of the high affinity IgE receptor contains positive and negative 
cis-acting elements which contribute to the hematopoietic cell specificity of 
expression of this gene (Brini et al., 1993). An Alu sequence in the myeloperoxi- 
dase (MPO; 17q23.1) gene promoter contains four repeats related to the consen- 
sus recognition sequence for nuclear hormone receptors (AGGTCA) (Piedrafita et 
al., 1996). This sequence acts as a composite SP1-thyroid hormone-retinoic acid 
response element interacting with SP1 as well as the retinoic acid and thyroid 
hormone receptors (Piedrafita et al., 1996). An Alu sequence may also have been 
recruited to perform a regulatory function in the human 01-globin gene (HBQ/; 
16p13.3; Kim et al., 1989). Finally, the estrogen responsiveness of the human 
breast cancer (BRCA1; 17q21) gene appears to have been conferred by an Alu 
repeat located within the promoter region of the gene (Norris et al., 1995). Alu 
sequences have thus introduced a variety of different DNA sequence motifs capa- 
ble of binding a range of trans-acting factors that have altered the expression level 
or tissue specificity of the associated genes. 

Not all Alu sequences inserted into gene regions function as promoters or 
enhancers of transcription. Indeed, an Alu sequence in the third intron of the 
Wilms’ tumor (WTI; 11p13) gene, 12 kb downstream of the promoter, acts as a 
transcriptional silencer, repressing transcription of the WTI gene in cells of non- 
renal origin (Hewitt et al., 1995). Since this silencer can function in an orienta- 
tion- and distance-independent fashion, Hewitt et al. (1995) suggested that it may 
have acquired silencer function rather than having simply possessed a silencer 
function intrinsic to the Alu sequence. Another example of a silencer is the RRE 
repetitive sequence element 1 kb upstream of the murine erythropoietin receptor 
gene (Youssoufian and Lodish, 1993). This sequence, one of ~10° copies in the 
mouse genome, may exert its cis-mediated repressor effect on the EpoR gene by 
read-through transcription. Finally, a 27 bp sequence that is important in the neg- 
ative regulation of a murine immunoglobulin x light chain gene appears to have 
been derived from a B1 repetitive element (Saksela and Baltimore, 1993). 
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A motif within an Alu sequence, 3.6 kb 5’ to the transcriptional initiation site 
of the human parathyroid hormone (PTH; 11p15.1-p15.3) gene, contributes 
toward a negative calcium response element (McHaffie and Ralston, 1995; 
Okazaki et al., 1991). This element possesses a 12 bp palindromic core (TGA- 
GACAGGGTCTCA) and since it is common in the human genome courtesy of 
the wide distribution of Alu sequences, it may have provided the means for the 
expression of other genes to be down-regulated by extracellular calcium. 

Wu et al. (1990) have proposed that an Alu repeat 2.2 kb upstream of the human 
€-globin (HBE1; 11p15.5) gene abolishes down-regulation of the gene mediated 
by a silencer element 4.5 kb upstream. These authors suggested that the down- 
regulation was caused by transcriptional interference and that the Alu repeat 
somehow blocks transcription from the upstream element specifically in embry- 
onic erythroid cells where it is transcriptionally active. 

Upon insertion, Alu sequences may also provide alternative sites for transcrip- 
tional initiation, as found in the human apolipoprotein B mRNA-editing enzyme 
(APOBEC]) gene (12p13.1; Fujino et al., 1998). The human APOBECI gene is 
expressed exclusively in the small intestine whereas in rodents, the gene is more 
widely expressed. Whilst the human gene contains two Alu sequences and two 
major transcriptional initiation sites, its murine counterpart lacks Alu repeats and 
contains a single transcriptional initiation site (Figure 5.6). Insertion of the Alu 
sequences must therefore have occurred during the last 100 Myrs since the diver- 
gence of the primate and rodent lineages. In the human gene, the first Alu repeat 
contains the first of the transcriptional initiation sites and lies upstream of a 
region exhibiting strong homology to the murine intestinal promoter. The second 
Alu repeat contains the second transcriptional initiation site and is located in the 
first intron. Comparison with the murine gene suggests that Alu sequence inser- 
tion may have split the human intestinal promoter leading to utilization of the 
downstream Alu sequence as an alternative site of transcriptional initiation. In 
this case, it would appear as if promoter function has simply adapted to the pres- 
ence of the Alu repeat rather than being qualitatively altered by it. 


Endogenous retroviral elements. Not all inserted sequences implicated in 
influencing gene expression are Alu repeats. One example of the recruitment of 
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Figure 5.6. Comparison of human and mouse APOBEC] gene promoters (from Fujino et 
al., 1998). Transcriptional initiation sites are denoted by vertical arrows. Alu sequences 
are denoted by horizontal arrows. The murine exon 4 is homologous to a portion of 
human exon 2. The region of transcriptional initiation between nucleotides -848 and - 
1034 of the human gene is ~70% homologous to the region of the murine gene promoter 
responsible for intestinal expression. 
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non-Alu repetitive sequences within a mammalian promoter is provided by the 
human amylase genes. These genes are located in a 230 kb region of chromosome 
1p21 (Figure 5.7); two genes (AMY2A and AMY2B) are expressed in the pancreas, 
three (AMYJA, AMYIB, AMYIC) are expressed in the salivary gland and one 
(AMYP1) represents a truncated pseudogene. A complete y-actin processed 
pseudogene is located immediately upstream of AMY2B, the consequence of an 
insertion event ~40 Myrs ago, after the divergence of the New World monkeys 
but before the divergence of the Old World monkeys from the human-ape lineage 
(Emi et al., 1988; Samuelson et al., 1996, 1988, 1990). Whether this pseudogene 
plays a role in regulating amylase gene expression is unclear but the observation 
that New World monkeys do not possess the pseudogene or express salivary amy- 
lase suggests that the pseudogene insertion could have been involved in the 
switch in tissue specificity of amylase gene expression. In the other four amylase 
genes, the actin pseudogene is interrupted by an endogenous retroviral sequence, 
the result of a second insertion which occurred in the human-ape lineage after the 
divergence of Old World monkeys (Samuelson et al., 1996) and prior to the more 
recent gene duplications (Figure 5.7). The results of gene expression studies in 
transgenic mice were initially consistent with the view that the retroviral 
sequence was essential for salivary gland-specific expression of the AMYIC gene 
(Ting et al., 1992). The AMY2A gene, which contains only a residual long termi- 
nal repeat (LTR) left by excision of the retroviral sequence, is expressed in the 
pancreas. Excision of the retroviral sequence from AMY2A thus appeared to be 
associated with reversion to a pancreas-specific pattern of expression (Figure 5.8). 
Transcription of the pancreatic amylase genes was found to be initiated at exon a 
whereas transcription of the salivary gland amylase genes appeared to be initiated 
from an untranslated exon within the actin pseudogene (Figure 5.8). It was there- 
fore considered possible that insertion of the retroviral element activated a cryp- 
tic promoter within the actin pseudogene that specified the transcriptional 
initiation site for the expression of the salivary amylase genes. However, more 
recent studies in various Old World monkey species lacking the retroviral 
sequence have indicated that salivary amylase gene expression predated the retro- 
viral insertion which cannot therefore be regarded as essential for amylase expres- 
sion in the salivary gland (Samuelson et al., 1996). 

There are now a number of other examples of endogenous retroviral elements 
which have been recruited to promoter function. These elements have been aptly 
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Figure 5.7. Distribution of the amylase genes on human chromosome 1p21. The 
positions of the five amylase genes (AMY2B, AMY2A, AMYIA, AMYIB and AMYIC) 
and the pseudogene AMYPI are indicated by boxes. ERVA denotes the endogenous 
retroviral element present in three copies upstream of each salivary amylase gene. The 
positions of the retroviral LTRs are marked +. ACTGP3 is a y-actin pseudogene located 
upstream of the AMY2B gene. Sequences related to this pseudogene are represented by 
solid boxes (redrawn from Meisler and Ting ,1993). 
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Figure 5.8. Evolution of the human amylase genes. The insertions of the y-actin 
pseudogene (solid bar) and the retrovirus (ERVA1) occurred around 40 Myrs ago. Exon a 
and the untranslated exon (NTE) are represented by open boxes. An arrow denotes the 
transcriptional start site (redrawn from Meisler and Ting, 1993). 


described as ‘perpetually mobile footprints of ancient infections’ (Sverdlov, 1998) 
and this mobility has sometimes been put to constructive use by evolution. For 
example, the hematopoietic cell-specific expression of the human zinc finger gene 
ZNF80 (3q13.3) is driven by the LTR of ERV9, a member of a low copy number 
family of endogenous retroviral elements (Di Cristofano et al., 1995a, 1995b). 
Since the ERV9 insertion was not found in the African green monkey, rhesus 
macaque or orangutan, the integration of this element must have occurred after 
the divergence of the orangutan from the other great apes but before the diver- 
gence of the gorilla. 

The LTR of a HERV-H-related sequence within an intron of a human phos- 
pholipase A2-like (Pla2l; 8q24) gene has been shown to be important for Pla2l 
gene expression (Feuchter-Murthy et al., 1993; Kowalski et al., 1996; 1997). Since 
the retroviral LTR is also present in the orthologous genes of chimpanzee and 
gorilla but not in orangutan and lower primates, we may infer that it was inte- 
grated into the ancestral primate genome about 15-20 Myrs ago. Upon further 
analysis, it has become clear that the teratocarcinoma cell-specific Pla2l transcript 
is actually a fusion transcript between two once distinct genes, the HERV-4- 
associating 1 (HHLA1; 8q24) gene and the otoconin 90 (OC90; 8q24) gene 
(Kowalski et al., 1999). Presumably the LTR acts not only as a strong promoter 
but also as an inducer of transcriptional fusion, at least in teratocarcinoma cells 
where HERV-H LTRs are known to be active transcriptionally. 

A 6.3 kb endogenous retroviral element of the HERV family has been inserted 
into the human pleiotrophin (PTN; 7q33) gene between the exons specifying the 
5’ untranslated region and those encoding the protein product (Schulte and 
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Wellstein, 1998). This HERV element has been shown to drive the expression of 
the PTN gene in trophoblasts and in choriocarcinoma cell lines (Schulte et al., 
1996). The HERV sequence is also present in the Ptn genes of chimpanzee and 
gorilla but not in that of the rhesus macaque indicative of a genomic insertion 
event which occurred after the divergence of the great apes from the Old World 
monkeys some 23 Myrs ago. The expression of the human HLA-DRB6 gene is 
driven by the LTR of a Mouse Mammary Tumor Virus (MMTV)-like sequence 
which substituted for the original promoter upon insertion (Mayer et al., 1993). 
Since this MMTV-LTR is also present in the macaque, the insertion event must 
have occurred >23 Myrs ago. Finally, Feuchter et al. (1992) have employed a sys- 
tematic screening strategy to demonstrate that the expression of a number of 
other human cellular genes may be influenced by endogenous retroviral ele- 
ments; these include the cell division cycle 4-like (CDC4L) gene. 

Retroviral insertion may affect gene expression even if the element is inserted 
outwith the promoter region. One example is the LTR of an RTV,-H element 
which is present in the 3’ UTR of a human placentally expressed gene (termed 
‘PLT’ by the authors, Goodchild et al., 1992); the Plt mRNA undergoes alterna- 
tive splicing at its 3’ end with polyadenylation occurring within the LTR in one 
of the transcripts. This sequence is present in the Pit genes of the great apes and 
Old World monkeys and therefore must have been inserted prior to the diver- 
gence of these groups. 

An LTR of an ERV9 element has been found at the 5’ boundary of the human 
B-globin (HBB; 11p15) gene locus control region (LCR; see Chapter 1, section 
1.2.8) just upstream of the DNAse I-hypersensitive site HS5 (Long et al., 1998). 
This LTR is composed of 14 tandem repeats containing recurring GATA, 
CACCC, and CCAAT motifs that are potentially capable of binding GATA-bind- 
ing factor, BKLF/TEF2 and C/EBP transcription factors respectively. The orthol- 
ogous sequence in gorilla has only five repeats whilst the repeat number is 
polymorphic in humans. Reporter gene studies have demonstrated that this LTR 
possesses both enhancer and promoter activity in erythroid cells (Long et al., 
1998). Moreover, in erythroid cells, the LTR activates transcription of the down- 
stream retroviral R and H5 regions and of genomic regions still further down- 
stream (Long et al., 1998). The HS5 LTR may therefore play a role in regulating 
the transcription of the human B-globin LCR (which is preferentially transcribed 
in erythroid cells) which may in turn serve to open up the chromatin structure of 
the B-globin gene domain. 

Retroviral insertions have also influenced gene expression in other mammalian 
species. Thus the promoter of the rat oncomodulin gene (human counterpart 
OCM; 7p13-p11) contains a long terminal repeat of an intracisternal A particle 
(IAP a family of endogenous retroviral elements) which has been recruited to per- 
form a gene regulatory function (Banville and Boie, 1989). Since the mouse lacks 
the integrated IAP element, the IAP insertion must have occurred after the diver- 
gence of the two rodent species 40 Myrs ago (Banville et al., 1992). 


LINE elements. LINE elements may also have served as mobile regulatory 
sequences altering the expression of target genes. They have a tendency to acquire 
promoter sequences from non-LINE sources, with different sequence lineages 
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acquiring different promoters (Adey et al., 1994a). These regulatory sequences 
may then confer a novel specificity of expression upon genes with which they 
become associated. Thus, a LINE element at position —800 in the human factor 
IX (F9; Xq27.1-q27.2) gene promoter has been implicated in conferring high 
level liver-specific expression on the F9 gene (Kurachi et al., 1997). An enhancer 
some 20 kb 5’ to the apolipoprotein(a) gene (LPA; 6q27) which contains binding 
sites for Ets and Sp1 transcription factors, also resides within a LINE element 
(Yang et al., 1998). 

The polyadenylation signal of the murine thymidylate synthase (Zyms) gene has 
been derived from an inserted LINE element (Harendez and Johnson, 1990). 
Similarly, the polyadenylation sites of several human genes including the B-tubu- 
lin (TUBB; 6p21.3) gene have been derived from MIR elements, mammalian 
wide interspersed repeats (Murnane and Morales, 1995). 

As we have seen (Chapter 1, section 1.4.3), only a proportion of LINE elements 
are transpositionally active, the remainder having been inactivated by truncation 
and rearrangement. Adey et al. (1994b) performed phylogenetic analysis to deduce 
the sequence of an ancestral murine transpositionally active LINE element. This 
element was then ‘resurrected’ by chemical synthesis and shown in vitro to possess 
promoter activity. 


Minisatellites and microsatellites. Several minisatellites are known to have 
become recruited as gene regulatory elements. These include minisatellites 1 kb 
3’ to the polyadenylation signal of the human HRAS proto-oncogene (11p15.5; 28 
bp repeat unit; Green and Krontiris 1993), 600 bp 5’ of the transcriptional initia- 
tion site of the human insulin UNS; 11p15.5; 14 bp repeat unit; Catigani- 
Kennedy et al., 1995) gene and 4.1 kb upstream of the human insulin-like growth 
factor II JGF2; 11p15; Paquette et al., 1998) gene, and the minisatellite in the 
D,/J, intron of the human immunoglobulin heavy chain UGHD/IGHF; 
14q32.33) gene cluster (Treppicchio and Krontiris, 1992). The HRAS minisatel- 
lite binds members of the rel/NF«B family (Treppicchio and Krontiris, 1993) 
whilst the JNS minisatellite binds the transcription factor Pur-1 (Kennedy et al., 
1995). The IGHD/IGH# minisatellite binds a mycHLH protein closely related to 
USF/MLTF (Treppicchio and Krontiris, 1992). This element may influence 
IGHD/IGH#7 gene expression since sequestration of the transcription factor by 
the minisatellite inhibits transcriptional activation through a bone fide USF 
enhancer element. Since the HRAS, INS and IGHD/IGH7 minisatellites are 
absent from the analogous positions in orthologous non-primate genes, it would 
appear that evolution has recruited these elements during primate evolution. 
Such minisatellites may have provided the raw material for promoter and 
enhancer sequences which have then been optimized by selection. Alternatively, 
if transcriptional effects emanating from these sequences are comparatively 
minor, selection would probably have been unable either to improve or remove 
them and they would have remained as transcriptional control elements with 
minor effect. 

Microsatellites can also serve as regulatory elements and indeed some are con- 
served at orthologous positions in the genomes of different species (Meyer et al., 
1995; Moore et al., 1991; Stallings, 1994, 1995; Stallings et al., 1991). One example 
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is the (TCAT), repeat element in the first intron of the human tyrosine hydroxy- 
lase (TH; 11p15.5) gene (Meloni et al., 1998). This repeat is similar to the consen- 
sus thyroid response element (TRE) present in the human and rat TH genes and 
gel shift assays have provided evidence for the formation of specific complexes 
between the tetranucleotide repeat and proteins found in HeLa cell extracts. A 
(CCTTT), pentanucleotide repeat polymorphism (51-72 copies) found within the 
promoter region of the human inducible nitric oxide synthase 2A (NOS2A; 
17qll-ql2) gene is also polymorphic in chimpanzees and gorillas but is 
monomorphic in orangutans and macaques (Xu et al., 1997); its influence, if any, 
on promoter function is not known. 


5.1.13 mRNA editing 


The existence of mRNA editing represents something of an evolutionary puzzle 
since its selective advantage is not immediately obvious (Covello and Gray, 1993). 
The best understood example in humans is that involving the apolipoprotein B 
mRNA. ApoB-100, produced in the liver, is essential for the production of very 
low density lipoprotein whereas ApoB-48 is required for fat absorption in the 
small intestine. Both proteins are encoded by the same (APOB; 2p23-p24) gene. 
ApoB-48 mRNA is generated as a result of the introduction of an in-frame trans- 
lational termination codon at the mRNA level by the deamination of cytosine to 
uracil (C6666T) in the first base of the CAA codon encoding Gln2153. A con- 
served 29 nucleotide element flanking the edited base (6662-6690) is found in 
mammals; this includes a regulator region, a spacer and a mooring sequence 
which is required for the mRNA editing process. By contrast to the situation 
found in mammals, chicken Apob mRNA is not edited although the various tis- 
sue-specific factors that serve to mediate the modification in mammals are pre- 
sent (Teng and Davidson, 1992). The absence of mRNA editing in chicken is 
thought to be due to the presence of several single base-pair substitutions in the 
mooring region of the chicken Apob mRNA since the experimental introduction 
of A6671T, G6674T and C6680T substitutions into the chicken gene served to 
confer mRNA editing ability upon chicken cells (Nakamuta et al., 1999). 

The editing of the APOB mRNA is performed by a multi-protein complex 
(‘editosome’) whose catalytic component has been termed apobec-l. The 
APOBECI gene, localized to chromosome 12p13.1 in human, is not highly con- 
served when compared with its homologues in other mammals, consistent with its 
recent rapid evolution (Chan et al., 1997; Fujino et al., 1998). Apobec-1 shows sub- 
stantial sequence homology to cytidine/cytidylate deaminases (Chan et al., 1997) 
and it would thus appear that a protein with a nucleoside as substrate has evolved 
from a protein with a nucleotide as a substrate. It is unlikely, however, that 
apobec-1 evolved simply by gene duplication and divergence since mRNA editing 
requires multiple factors which would have had to have coevolved in order to 
function as a cohesive complex. Even although the separation of apobec-1 and the 
cytidine/cytidylate deaminases is ancient (Chan et al., 1997), apobec-1 is only 
found in mammals suggesting that the ancestral apobec-1 protein might have had 
a function other than mRNA editing. 
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mRNA editing is not however unique to APOB mRNA. Different types of 
mRNA editing have also been found in several other human mRNAs viz. a T>C 
transition in the Wilms’ tumor (WTI; 11p13) mRNA (Sharma et al., 1994), a 
T—A transversion in the a-galactosidase (GLA; Xq21.3-q22) mRNA (Novo et al., 
1995) and a C—>T transition in the neurofibromatosis type 1 (NF1; 17q11.2) 
mRNA (Skuse et al., 1996). 


5.1.14 Coordinate regulation 


The clustering of certain genes may be important for their coordinate regulation 
by common control elements. Possible examples of this include the genes encod- 
ing the spermatid-specific nucleoprotamines (PRM1 and PRM2; 16p13.2), the 
platelet membrane glycoproteins (ITGA2B and ITGA3; 17q21-q22), the albumin 
family (ALB, AFP, AFM, GC; 4q11-q13), the pregnancy-specific glycoproteins 
(PSG1, PSG2, PSG3, PSG4, PSGS, PSG6, PSG7, PSG8, PSG11, PSG12, 
PSG13; 19q13.2) and the fibrinogens a, B, and y (FGA, FGB and FGG; 4q31) 
(Chapter 8.5). In some cases, a head-to-head arrangement may potentiate the use 
of bidirectional promoters (Section 5.1.5). 

Coordinate regulation does not however require close linkage as evidenced by 
the case of the a- (HBA1) and -globin (HBB) genes on chromosomes 16p13.3 
and 11p15.5 respectively. Another example is that of the human ribosomal protein 
genes which encode the 80-90 ribosomal proteins that together constitute the 
ribosome. The expression of the different ribosomal protein genes, which can 
account for 7—9% of total cellular RNA, is coordinately regulated in response to 
the cell’s varying requirements for protein synthesis. However, the ribosomal pro- 
tein genes are highly dispersed in the human genome (Feo et al., 1992; Kenmochi 
et al., 1998; Table 5.2) indicating that their coordinate regulation must be brought 
about by the action of trans-acting factors rather than the influence of shared reg- 
ulatory sequences. 


5.1.15 Changes in expression of developmentally significant gene 
products 


If we possessed a thorough knowledge of all parts of the seed of any animal 
(e.g. man), we could from that alone, by reasons entirely mathematical and 
certain, deduce the whole conformation and figure of each of its members, 
and, conversely if we knew several peculiarities of this conformation, we 
would from those deduce the nature of its seed. 

René Descartes Oeuvres iv, 494 


The study of the molecular genetics of vertebrate development is still very much 
in its infancy. However, the identification of developmental control genes impor- 
tant in morphogenesis is proceeding apace as a result of studies of (i) model organ- 
isms including mouse, zebrafish, Drosophila and Caenorhabditis elegans 
(Postlethwait and Talbot, 1997) and (ii) human dysmorphic syndromes and con- 
genital malformations (Epstein, 1995; Kondo et al., 1998; Semenza, 1998). 

The spatial and temporal distribution patterns of expression of HOX genes 
have played an important role in the evolutionary emergence of novel body plans 
among the metazoa (Belting et al., 1998; Gellon and McGinnis et al., 1998; 
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Table 5.2. Chromosomal locations of chromosomally assigned human ribosomal protein 


genes 


Ribosomal protein gene 


Gene symbol 


Chromosomal location 


Ribosomal protein L3 
Ribosomal protein L4 
Ribosomal protein L5 
Ribosomal protein L6 
Ribosomal protein L7 
Ribosomal protein L8 
Ribosomal protein L9 
Ribosomal protein L10 
Ribosomal protein L11 
Ribosomal protein L13 
Ribosomal protein L15 
Ribosomal protein L17 
Ribosomal protein L19 
Ribosomal protein L22 
Ribosomal protein L23A 
Ribosomal protein L24 
Ribosomal protein L27 
Ribosomal protein L27A 
Ribosomal protein L28 
Ribosomal protein L29 
Ribosomal protein L30 
Ribosomal protein L31 
Ribosomal protein L32 
Ribosomal protein L35A 
Ribosomal protein L36A 
Ribosomal protein L37 
Ribosomal protein L38 
Ribosomal protein L41 
Ribosomal protein S2 
Ribosomal protein S3 
Ribosomal protein S3A 
Ribosomal protein S4 
Ribosomal protein S5 
Ribosomal protein S6 
Ribosomal protein S7 
Ribosomal protein S8 
Ribosomal protein S9 
Ribosomal protein S10 
Ribosomal protein $11 
Ribosomal protein S12 
Ribosomal protein S13 
Ribosomal protein S14 
Ribosomal protein S15A 
Ribosomal protein S17 
Ribosomal protein S18 
Ribosomal protein S24 
Ribosomal protein S25 


RPL3 
RPL4 
RPL5 
RPL6 
RPL7 
RPL8& 
RPL9 
RPL10 
RPL11 
RPL13 
RPL15 
RPL17 
RPL19 
RPL22 
RPL23A 
RPL24 
RPL27 
RPL27A 
RPL28 
RPL29 
RPL30 
RPL31 
RPL32 
RPL35A 
RPL36A 
RPL37 
RPL38 
RPL41 
RPS2 
RPS3 
RPS3A 
RPS4X/Y 
RPS5 
RPS6 
RPS7 
RPS8 
RPS9 
RPS10 
RPS11 
RPS12 
RPS13 
RPS14 
RPS15A 
RPS17 
RPS18 
RPS24 
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Chapter 4.2.1, Homeobox genes). The molecular basis of such a role has long been a 
puzzle but new studies are now providing a glimpse of how subtle genetic changes 
can have fairly dramatic effects on morphology. One of the best characterized 
examples is provided by a promoter mutation which appears to have morpholog- 
ical consequences for the evolution of the mammalian body plan: a 4 bp deletion 
in element C of the early enhancer region of the Hoxc8 gene of baleen whales 
(Shashikant et al., 1998). This lesion (TTAATTG-TT-G) is specific to baleen 
whales (five species tested) and is not found in the highly conserved Hoxc8 gene 
promoters of humans (HOXC8; 12q12-q13), rodents, artiodactyls or the toothed 
whales including sperm whales (Shashikant et al., 1998). In mice, the early 
enhancer region of the Hoxc8 gene promoter is required to initiate expression of 
the gene in the posterior region of the day 8.5 mouse embryo and to establish spa- 
tial domains of expression in the neural tube and mesoderm. After day 9.0, the 
late enhancer maintains anterior Hoxc8 gene expression and down-regulates pos- 
terior expression. The baleen whale Hoxc8 early enhancer region containing the 4 
bp deletion has been assayed in transgenic mouse embryos where it was found to 
direct the expression of the reporter gene to more posterior regions (4—5 somite 
levels posterior as compared with human or murine Hoxc8 enhancers) of the 
neural tube but failed to direct expression to the posterior mesoderm (Shashikant 
et al., 1998). Similar results were obtained when site-directed mutagenesis was 
used to introduce the same lesion into the murine Hoxc8 early enhancer region. 
Thus the 4 bp deletion in the Hoxc8 gene promoter of baleen whales may have 
played a role in modifying the developmental program of these cetaceans. One 
wider implication of this work is that there may be additional mutations yet to 
discover in the cis-acting regulatory sequences of other Hox genes and these 
lesions could have contributed to the evolution of body plan diversity during 
mammalian evolution. 

It is anticipated that mutational changes in a number of other genes encoding 
transcription factors that play a role in embryonic development will be character- 
ized in the coming years thereby shedding new light on the molecular basis of 
morphogenesis. Possible candidate genes would include the Pax gene family 
(Chapter 4.1.6; Balczarek et al., 1997; Noll 1993), the Sox gene family (Wegner, 
1999), the engrailed (EN1, 2q13-q21; EN2, 7q36) and Wingless (WNTI1; 12q12- 
q13) genes (Joyner 1996), the brachyury (T; 6q27) gene (Yasuo and Satoh, 1998) 
and the snail family of transcription factors (Sefton et al., 1998) as well as the heat 
shock protein 90 (HSP90) family of signal transduction chaperonins (Rutherford 
and Lindquist, 1998). Sequence differences between orthologous developmental 
regulatory genes should provide insights into the process of molecular evolution 
underlying morphological change (Budd, 1999; Eizinger et al., 1999). 


5.2 Transcription factors 


You geneticists may know something about the hereditary mechanisms that 
distinguish a red-eyed from a white-eyed fruit fly but you haven’t the slightest 
inkling about the hereditary mechanism that distinguishes fruit flies from ele- 
phants. 

WJ. Osterhout (1925) 
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5.2.1 Transcription factor families 


Many mammalian transcription factors belong to families whose members bind 
to very similar or identical DNA sequence motifs. Thus, there are at least eight 
cAMP-reponsive transcription factors of the CREB/ATF family that bind to the 
octanucleotide TGACGTCA (Hai et al., 1989). The evolutionary conservation of 
DNA sequence recognition can be fairly dramatic as for example in the case of the 
various members of the Brn-3 class of POU domain transcription factors found in 
both mammals and nematodes (Gruber et al., 1997). On the other hand, some 
transcription factor families possess members that exhibit a considerable degree 
of divergence in terms of their DNA binding specificities. For example, at least 
four members of the nuclear factor 1 (NF1) family recognize sequences contain- 
ing the trinucleotide TGG (Gil et al., 1988) and at least eight members of the 
mammalian Ets family bind to an 11 bp purine-rich motif containing a conserved 
GGA core (Wang et al., 1992). The DNA binding specificity of different Ets fam- 
ily members is determined by the nucleotides at the 3’ end of the Ets-binding site 
(Wang et al., 1992). 

The evolutionary subdivisions of transcription factor families as revealed by 
phylogenetic analysis may be paralleled by functional subdivisions (Elsen et al., 
1995). If function is conserved within (although not between) subfamilies, then 
the functions of novel transcription factors may be to some extent predictable by 
comparison with other members of the same subfamily. One example of this is 
provided by the high mobility group (HMG) protein superfamily of DNA-bind- 
ing proteins. These proteins possess one or more copies of an 80 amino acid 
domain termed the HMG box and have an evolutionary history that dates back 
1000 Myrs (Laudet et al., 1993). The HMG superfamily may be divided into two 
sub-families (i) the TCF/SOX family which comprises transcription factor pro- 
teins that contain a single sequence-specific HMG box and (ii) the UBF/HMG 
family of chromatin structure regulatory proteins which possess multiple HMG 
boxes that exhibit little if any sequence specificity. Representatives of the first 
family in the human genome include the chromosomally dispersed SRY-related 
HMG-box (SOX) genes (SOX1, 13q34; SOX2, 3q26-q27; SOX3, Xq26-q27; 
SOX4, 6p23; SOXS, 12p12; SOX9, 17q23; SOX10, 22q13; SOX11, 2p25; SOX20, 
17p13; SOX22, 20p13; Wegner 1999), the lymphoid enhancer-binding factor 1 
(LEF1; 4q23-q25) and the hepatocyte nuclear factor 1a (TCF1; 12q24) genes. 
Human representatives of the second family include the high mobility group 
(non-histone chromosomal) protein genes, HMGI (13q12), HMG2 (4q31), 
HMG4 (Xq28), HMG/4 (21q22) and HMGI7 (1p35-p36) and upstream binding 
transcription factor (UBTF; 17q21). The diversity of function exhibited by mem- 
bers of the HMG superfamily is a result of the action of a number of different evo- 
lutionary processes including gene duplication, intragenic duplication, exon 
shuffling and single base-pair substitution mediated divergence (Laudet et al., 
1993). 

Duplications or amplifications of transcription factor genes are sometimes very 
ancient as is evidenced by the limited homology still evident between nuclear fac- 
tor 1 (NF1) and the protein kinase family (Mannermaa and Oikarinen, 1989). On 
the other hand, some duplications have occurred during mammalian evolution 
[e.g. the transcription factors TCF] (12q24) and LEFI (4q23-q25); Gastrop et al., 
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1992] and the hedgehog gene family, sonic (SHH; 7q36) and Indian (IHH; 2q33- 
q35); Zardoya et al., 1996) and some as recently as 55 Myrs ago in the common 
ancestor of the simians e.g. the ZNF91 family (ZNF91; 19p12-p13.1; Bellefroid et 
al., 1995). 


5.2.2 Functional conservation of orthologous transcription factors 


The general transcription factor, TBP (TFIID), binds to the upstream TATA box 
element and is essential for transcription in all eukaryotes. TBP is highly con- 
served throughout the eukaryotes and this conservation even extends to the 
archaebacteria (Hoffman et al., 1990; Rowlands et al., 1994). Functional conserva- 
tion at the protein level is evidenced by the ability of yeast TBP to replace human 
TBP in in vitro transcription reactions and vice versa (Buratowski et al., 1988). Not 
surprisingly, therefore, yeast and human TBP have virtually identical sequence 
recognition characteristics, at least in vitro (Wobbe and Struhl, 1990). However, in 
vivo, human TBP cannot properly substitute for yeast TBP and the yeast cells in 
question grow extremely poorly (Cormack et al., 1991; Gill and Tjian, 1991). This 
difference between the in vitro and in vivo situations is salutary and probably 
reflects subtle differences in the interactions of TBP with activator proteins or 
other components of the transcriptional initiation complex. 

Interactions between cis-acting DNA sequence motifs and the trans-acting tran- 
scription factors binding them may thus be conserved over fairly long periods of 
evolutionary time. Another example is that of the rat pituitary-specific transcrip- 
tion factor Pit-1 which is able to bind to and activate the growth hormone gene 
promoter of the rainbow trout (Argenton et al., 1993). Similarly, mammalian ets 
genes are functionally homologous to the pointed gene of Drosophila (Albagli et al., 
1996). Functional conservation is also evident with other factors such as c-jun, the 
serum response factor SRE the CCAAT box-binding factor CP1 and the gluco- 
corticoid and estrogen receptors; the mammalian proteins can substitute for their 
yeast counterparts in order to activate gene transcription in yeast (Guarente and 
Bermingham-McDonogh, 1992). This functional conservation owes its existence 
to the fact that once specificity has been established between a given transcrip- 
tional activator and its cognate binding sites in the regulatory regions of different 
genes, both the structure of the DNA-binding domain of the activator and the 
sequence of the recognition site are evolutionarily constrained. Evolutionary 
change between both paralogous and orthologous transcription factors has never- 
theless occurred in a number of different ways and some of these are illustrated in 
the following sections. 


5.2.3 Functional redundancy of paralogous transcription factors 


Gene duplication and amplification initially generate functional redundancy as in 
the case of the c-Ets-1 (ETS1; 11q23) and c-Ets-2 (ETS2; 21q22) genes in mam- 
mals where the former appears to be dispensible and indeed replaceable by the lat- 
ter (Albagli et al., 1994). Functional redundancy is also apparent in the MyoD 
family of transcription factors, for example between MYODI (11p15.1) and 
MYFS (12q21), and between MYODI (11p15.1) and MYF6 (12q21) (Atchley et al., 
1994). It has also been noted that the transgenic inactivation of several genes 
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known to be important in mammalian development does not automatically lead 
to a deleterious phenotype, implying a degree of functional redundancy. We may 
surmise that developmental gene redundancy will have endured if its mainte- 
nance has served to confer a selective advantage (Cooke et al., 1997). On the other 
hand, functional redundancy could also have originated merely as a consequence 
of certain genes being either especially prone to duplication (Iwabe et al., 1996) or 
manifesting an increased probability of survival after duplication (Gibson and 
Spring, 1998). 

Functional redundancy does not however always endure and the relaxation of 
selective pressure consequent to gene duplication/amplification potentiates the 
diversification of paralogous transcription factors leading eventually to the emer- 
gence of families of related transcription factors with different DNA target 
sequence specificities and hence distinct functions. 


5.2.4 Paralogous transcription factors 


Perhaps the best characterized example of the divergence of paralogous transcrip- 
tion factors is provided by the nuclear receptor family (Chapter 4, section 4.2.3, 
Nuclear receptor genes). In this family, dimerized receptors typified by the estrogen 
receptor group, bind to two 6 bp half sites of the sequence TGACCT whereas those 
of the glucocorticoid receptor group recognise the related sequence TGTTCT 
(Chapter 7, section 7.5.2). The amino acids involved in discriminating between 
these motifs are located in the ‘P-box’ of the DNA recognition helix. The P-box of 


Receptor group P-box 
— | THR EGCKG 
pL } RAR EGCKG 
o~ | HRA EGCKG 
} COUP EGCKS 
} RXR EGCKG 
} HNF4/TLL DGCKG 
= } ER EGCKA 


a | GR GSCKV 


— } KNI EGCKS 


Figure 5.9. P-box sequences of different receptor groups within the nuclear receptor 
family (redrawn from Zilliacus et al., 1994). 
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the glucocorticoid receptor group appears to have evolved from a progenitor 
resembling the present-day estrogen receptor (Amero et al., 1992; Laudet et al., 
1992; Martinez et al., 1991; see Figure 5.9). Thus, mutations in the P-box have 
altered the DNA-binding specificity of a receptor with high affinity for T@ACCT 
sites and low affinity for TGTTCT sequences, to proteins with the opposite speci- 
ficity. Residues Glu439 and Ser440 are critical for conferring binding specificity on 
the receptor (Zilliacus et al., 1994). 

Another example of a functional change in the DNA-binding domain within a 
family of paralogous transcription factors is provided by Pax6 (PAX6; 11p13) 
which possesses an Asn at amino acid residue 47 in the third o-helix of the paired 
domain (Balczarek et al., 1997); this amino acid recognizes the nucleotide T. All 
other known Pax genes encode proteins with a His at this position; this amino 
acid shows higher affinity toward the nucleotide G. 


5.2.5 Orthologous transcription factors 


Orthologous transcription factors have also diverged over evolutionary time. 
Divergence has occurred by, for example, incorporation of novel motifs or the 
amplification of existing motifs. Thus, the human transcription factor MOK2 
(MOR2; 19q13.2-q13.3) contains 10 zinc-finger motifs in comparison to seven in 
its murine homologue (Ernoult-Lange et al., 1995). Similarly, the human ery- 
throid-specific transcription factor Eryth 1 contains different numbers of repeat 
motifs as compared with its chicken counterpart (Trainor et al., 1990). Finally, gene 
sequences encoding TBP, the general transcription factor, exhibit a considerable 
degree of sequence simplicity as a direct result of simple repeat amplification, per- 
haps by replication slippage (Hancock, 1993). The incorporation of new repeats 
and the consequent enlargement of TBP may have permitted novel interactions 
with domains of other proteins leading to the acquisition of new functions. 


5.2.6 Alternative splicing of transcription factor genes 


Alternative splicing provides the means to generate transcription factor diversity 
in the absence of gene duplication. Thus, alternative splicing of the PAX8 (2q12- 
q14) gene results in the alternative presence or absence of a single Ser residue in 
the recognition helix of the paired domain which is critical for DNA binding 
(Kozmik et al., 1997). The two forms of Pax8 differ in their binding specificity. 


5.2.7 Promoter shuffling in transcription factor genes 


Promoter modularity arising from the shuffling of component motifs (Section 
5.1) often occurs in the promoters of paralogous transcription factor genes. This 
provides the means for changes in the expression of single genes to lead to 
changes in the expression of many downstream target genes, a process which has 
contributed significantly to the evolution of complex gene expression networks. 


5.2.8 Transcription factor-binding site interactions 


DNA sequence elements that play a role in gene regulation have evolved so as to 
provide appropriate binding sites for their cognate transcription factors. In many 
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cases, element function (measured in terms of influence on transcription) is 
directly proportional to the probability that a specific site is bound by the tran- 
scription factor protein. It follows that the same function can be achieved with a 
strong binding site and a small amount of protein as with a weak binding site and 
a large amount of protein. The extent of selective pressure on the DNA sequence 
is therefore likely to be determined by a combination of the cellular abundance of 
the protein, the functional activity of the protein and the initial binding strength 
of the DNA-protein interaction (Berg, 1992). Transcription factor binding may 
however also be influenced by the number of binding sites available on the pro- 
moter (Section 5.1.11). Thus, the pituitary-specific growth hormone (GH1; 
17q22-q24) gene of humans and rats contains two binding sites for the transcrip- 
tion factor Pit-1 whereas four such sequences occur in the promoter of the homol- 
ogous Gh1 gene of Oncorhynchus mykiss, the rainbow trout (Argenton et al., 1993). 
Rat Pit-1 has been shown to be capable of binding to three regions of the trout 
gene promoter thereby driving expression of a downstream reporter gene 
(Argenton et al., 1993). 


5.2.9 Exon shuffling in the evolution of transcription factors 


Transcription factors are sometimes encoded by genes that are evolutionarily 
unrelated yet share the same type of DNA-binding domain. Such genes may have 
arisen by exon shuffling (Chapter 3, section 3.6), the process by which functional 
domains encoded by one or more exons have been dispersed to a variety of differ- 
ent proteins. That some DNA-binding domains are encoded by several exons 
implies that these exons must have been shuffled together as a single block. 
Matsuo et al. (1994) have suggested that the presence of short unconserved introns 
with different types of splice junction within the mammalian Oct-2 (POU2F2; 
chromosome 19) gene may have served to prevent recombination between the 
exons comprising the conserved POU domain without inhibiting the shuffling of 
this domain in its entirety between different transcription factor-encoding genes 
during evolution. 

The divergence of the mammalian T-box family of transcription factors, which 
began before the separation of the vertebrate, arthropod and nematode lineages, 
has occurred both by the insertion or deletion of specific introns, or by intron slid- 
ing (Chapter 3, section 3.4) leading to variations in exon length (Wattler et al., 
1998). 
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6 


Pseudogenes and their 
formation 


6.1 Pseudogene formation 


Pseudogenes may be regarded as ‘floating hulks,’ gene-derived DNA sequences 
that are no longer capable of being expressed as protein products. Some are the 
remnants of once active genes that have acquired inactivating mutations which 
either preclude their transcription or at the very least prevent mRNA translation. 
Others are retrotransposed copies of expressed mRNAs which, since they almost 
always lack promoter sequences, are incapable of being expressed and therefore 
also tend to accumulate deleterious mutations. Pseudogenes ought not, however, 
to be regarded entirely as evolutionary cul-de-sacs. Although it is unlikely that 
reactivating (reverse) mutations will occur so as to restore their function, pseudo- 
genes may nevertheless influence the evolution of other functionally significant 
sequences by for example mediating recombination events (Beck et al., 1996; 
Takahashi et al., 1982) or acting as sequence donors in gene conversion (Section 
6.1.6). 


6.1.1 The generation of pseudogenes by duplication 


A considerable number of pseudogenes have been described in the human 
genome that have retained the exon-intron structure of their functional source 
genes (Figure 6.1). A selection of the known pseudogenes of this type is presented 
in Table 6.1. One assumes that such sequences arose by simple duplication of func- 
tional gene sequences but became inactivated since their intrinsic redundancy 
prevented selection from maintaining their potential to be expressed (Wilde, 
1985). Many of the examples listed in Table 6.1 have therefore accumulated non- 
sense mutations, frameshift deletions and insertions, or single base-pair substitu- 
tions within splice sites, any one of which would have been sufficient to render 
the expression of the sequences impossible. Perhaps the archetypal example is 
that of the human pseudogene (CYP2I1P) for cytochrome P450c21 which is 
closely linked to its cognate gene (CYP21; 6p21.3) as a result of a duplication 
event which occurred before the separation of apes from Old World monkeys > 23 
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Figure 6.1. Schematic diagram depicting the different mechanisms of pseudogene 


formation. 


Table 6.1. Examples of human pseudogenes (y) generated by duplication 


Pseudogene Complete/partial Comments Reference 

Cytochrome P450c21 Complete Linked to CYP27 Harada et a/. (1987) 
(6p21.3) gene 

Amylase Partial (lacks exons 1-3) Linked to AMY gene Groot et a/. (1990) 
cluster (1p21) 

Apolipoprotein C1 Complete Linked to APOC7 Raisonnier (1991) 


Tyrosinase 


Dopamine D5 receptor 


(2) 


Olfactory receptor 


Histone H2B 


Keratin K14 


Fibroblast growth 
factor (16?) 


XG blood group 


Partial (exons 4 and 5 
and non-coding ) 


Complete (intronless y 
derived from intronless 
gene) 


Complete 


Complete 


Complete 


Partial (exon 2, 
intron 2, 
exon 3 and 3’ UTR) 


Partial (exons 1, 2A, 2B, 
3) 


(19q13.2) gene 


11p11.2-cen. Unlinked 
to TYR gene (11q14- 
q21). Not transcribed. 


yw on chromosomes 1 
and 2. Unlinked to 
DRD5 gene (4p15-p16). 
Linked to olfactory 
receptor gene cluster on 
chromosome 17. 
Transcribed. 


Linked to other histone 
genes (1q21-q23). 
Unclear if linked to 
KRT14 gene at 17q12-q21 


Transcribed. Dispersed. 
Unlinked to FGF7 gene 
(15q15-q21). 

Yq11.21. XG gene 
maps to pseudo- 
autosomal boundary 
region, Xp22-pter 
Transcribed. 


Giebel et a/. (1991) 


Marchese et a/. (1995) 
(1995) 


Crowe et a/. (1996) 


Albig et a/. (1997) 


Savtchenko et a/. 
(1988) 


Kelley et a/. (1992) 


Weller et a/. (1995) 
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Pseudogene 


Complete/partial 


Comments 


Reference 


Interleukin-8 receptor 


Kallmann syndrome 
gene 
a2-Macroglobulin 


von Willebrand factor 
Mannose-binding 


protein 


Serum amyloid A 
(SAA3) 


Carbonic anhydrase V 
Calcitonin 


Protein S 


B-Glucocerebrosidase 
Metaxin 


B-Glucuronidase (2) 


a-Globin 


C-Globin 


Immunoglobulin Ce2 


Immunoglobulin y 


Immunoglobulin Vx (3) 


Immunoglobulin V,, 


HLA class | (9) 


HLA class II 


Complete 


Complete 


Complete 


Partial (exons 23-28) 


Complete 


Complete 


Partial (exons 3-7, 
introns 3-6) 


Partial (exons 2 and 3) 


Partial (exons 2-15 
present) 


Complete 


Complete 


Complete and partial 


Complete 


Complete 


Partial 


Complete 


Complete 


Complete 


Complete 


Complete 


2q34-q35. Linked to 
IL8RA and /LR8B 
genes. 


Yq11. Unlinked to 
KAL1 gene (Xp22.3). 


12p12-p13. Linked to 
A2M gene. 


Chromosome 22. 
Unlinked to VWF gene 
(12p13) 

10q22. Probably linked 
to MBP gene (10q11- 
q21) 

Not transcribed. Linked 
to SAA gene cluster 
(11p15). 

16p11.2-p12. Unlinked 
to CA5 gene (16q24.3). 
Linked to CALCA gene 
(11p15). 

Linked to PROS7 gene 
(3p11-q11). Not 
transcribed. 

Linked to GBA gene 
(1q21). Transcribed. 
Linked to MTX gene 
(1q21). Not transcribed. 
5p13 and 5q13. 
Unlinked to GUSB gene 
(7q11). 

Linked to HBA7 
(16p13.3-pter) gene. 
Transcribed. 

Linked to HBA7 
(16p13.3-pter) gene. 
Not transcribed. 
Linked to 
immunoglobulin Ce1 
IGHE gene (14q32). 
Linked to /GHG gene 
cluster (14q32). 


Chromosomes 1, 15, 22. 


Unlinked to /GKV gene 
cluster (2p12) 

Linked to 
immunoglobulin V,, 
(/GHV) gene cluster 
(14q32.3) 

Linked to HLA-A, HLA- 
E, HLA-G and HLA-F 
genes (6p21.3) 


Linked to HLA-DQB1 
genes (6p21.3) 


Ahuja et a/. (1992) 


del Castillo et a/. 
(1992) 


Devriendt et a/. 
(1989) 


Marchetti et a/. 
(1991) 


Guo et a/. (1998) 


Kluve-Beckerman et 
al. (1991) 


Nagao et a/. (1995) 


Lips et a/. (1989) 


Ploos van Amstel et 
al. (1990) 


Horowitz et a/. 
(1989) 


Long et a/. (1996) 


Speleman et a/. 
(1996) 


Whitelaw and 
Proudfoot (1983) 


Proudfoot et a/. 
(1984) 


Hisajima et a/. 
(1983) 


Takahashi et a/. 
(1982) 


Bentley and Rabbitts 
(1980); Lotscher et 
al. (1986) 


Cook and Tomlinson 
(1995) 


Hughes (1995); 
Gruen et a/. (1996) 
Geraghty et a/. 
(1992) 


Figueroa et a/. 
(1994) 
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Table 6.1 (continued) 


Pseudogene Complete/partial Comments Reference 

Prostaglandin EP4 Complete Unclear if linked to Foord et a/. (1996) 

receptor (2) PTGER4 gene (5p13) 

Adrenoleukodystrophy Partial (exons 7-10) Unclear if linked to Braun et a/. (1996) 

ALD gene at Xq28. 

B-Globin Complete Linked to HBB Chang and Slightom 

(1984) gene (11p15) 

Aldehyde Complete Transcribed Hsu and Chang (1996); 

dehydrogenase Hsu et a/. (1997) 

(ALDH8) 

Iduronate-2-sulfatase Partial (exons 2 and 3, Linked to /DS gene (Xq28). Timms et a/. (1995); 
introns 2 and 7) Not transcribed. Rathmann et a/. (1995) 


Numbers in brackets denote numbers of pseudogenes in cases where more than one has been found. 


Myrs ago (Horiuchi et al., 1993). This pseudogene contains three inactivating 
mutations: an 8 bp deletion in exon 3, the insertion of a T in exon 7 and a non- 
sense mutation in exon 8 (Kawaguchi et al., 1992). The 8 bp deletion is present in 
both humans and chimpanzees whereas the other two mutations are human-spe- 
cific (Kawaguchi et al., 1992) consistent with the 8 bp deletion being the original 
inactivating mutation in the human-chimpanzee lineage. In gorilla and orang- 
utan, the extra Cyp2] gene copies are inactivated by various other mutations 
(Kawaguchi et al., 1992). 

From inspection of Table 6.1, it is apparent that such pseudogenes are essen- 
tially of two types: either complete copies of the source gene or partial copies of 
that gene. Complete copies tend to be linked to the source gene (examples given 
in Figure 6.2) whereas partial copies are often (although not always) unlinked to 


~65kb 
~< > 
B-Globin LCR ea EOE Lil atl 
gene cluster ———_—___| fa i & DE m 
11p15.5 HBE1 HBG2 HBG1 HBBP1 HBD HBB 
~75kb 
~< > 
a-Globin LCR ie eye See “Lge 
gene cluster tI ar H HEHE Y 
16p13.3 HBZ HBZP | HBAP1 \ HBA1 HBQ1 
HBAP2 HBA2 
~50kb 
< > 
Growth hormone LCR > > > > > 
gene cluster —} is oer aia a fe E E E 
17q22-q24 GH1 CSHP1 CSH1 GH2 CSH2 
B Expressed, functional gene — Direction of transcription LCR Locus control region 
Expressed, no known function Pseudogene 


Figure 6.2. Human £-globin, o-globin, and growth hormone gene clusters illustrating 
relative locations of nonprocessed pseudogenes. 
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the gene from which they originated. This may reflect mechanistic differences in 
pseudogene generation. 

Since many of the pseudogenes listed in Table 6.1 are duplicated copies of entire 
genes that include their original promoters, it is not surprising that some are 
capable of being transcribed. However, some of the partial, truncated pseudogenes 
also appear to be transcribed. We may nevertheless assume that the nonsense 
mutations and frameshift insertions and deletions acquired by these sequences 
serve to preclude the translation of any mRNAs synthesized. 

A particularly good example of the generation of multiple partial pseudogenes 
is that of the numerous sequences related to the neurofibromatosis type 1 (NF1; 
17q11.2) gene identified on human chromosomes 2, 12, 14, 15, 18, 20, 21 and 22 
(Gasparini et al., 1993; Hulsebos et al., 1996; Kehrer-Sawatzki et al., 1997; Legius 
et al., 1992; Marchuk et al., 1992; Purandare et al., 1995; Suzuki et al., 1994). These 
NF1-homologous sequences contain numerous nucleotide substitutions, small 
deletions and insertions (some inactivating) but still exhibit >90% homology to 
the corresponding sequences of the NF1 gene. They appear to have arisen by par- 
tial duplication of the NF1 gene between 22 Myrs and 33 Myrs ago with subse- 
quent rounds of duplication generating new copies which were then transposed to 
the pericentromeric regions of the other chromosomes (Régnier et al., 1997). 

In cases where the parental source gene is intronless [e.g. some argininosucci- 
nate synthetase pseudogenes (Nomiyama et al., 1986) and the dopamine D5 recep- 
tor pseudogene (Marchese et al., 1995)], the pseudogenes may have been 
secondarily created by the genomic duplication of pre-existing processed pseudo- 
genes (see Section 6.1.2 below). 

The occasional human pseudogene manifests a presence/absence polymor- 
phism, for example the T-cell receptor B v6.10 pseudogene which has been found 
in the genome of most but not all individuals tested (Li et al., 1993). 


6.1.2 The generation of pseudogenes by retrotransposition 


The second main category of pseudogenes probably accounts for the majority of 
inactivated gene sequences found in the human genome. A selection of pseudo- 
genes of this type is presented in Table 6.2. They are termed processed pseudogenes 
since they are thought to have originated by retrotransposition of a correctly 
processed mRNA intermediate lacking intervening sequences (Vanin, 1984; 
Weiner, 1986; Figure 6.1). The mRNA origin is evidenced by the lack of introns 
and the frequent presence of poly(A) tracts at their 3’ ends (Vanin, 1984). 
Flanking direct repeats, commonly of 9-17 bp (Vanin, 1984), are usually present 
and have probably been acquired during the process of integration. Many 
processed pseudogenes correspond to the entire length of the coding region of 
their source genes but others are truncated (e.g. CFTR exon 9 pseudogene frag- 
ments; Rozmahel et al., 1997). The vast majority of processed pseudogenes are 
located on different chromosomes from their functional source genes (Table 6.2). 
This is not very surprising in view of the fact that retrotransposition necessarily 
requires a mobile mRNA intermediate. 

Many processed pseudogenes have, like their duplication-derived counterparts, 
acquired multiple genetic lesions that preclude their expression. This is not how- 
ever an invariant finding as evidenced by the human metallothionein processed 
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pseudogene which has no such lesions and retains an open reading frame (Karin 
and Richards, 1982). This sequence is nevertheless probably not transcribed 
since, owing to its mRNA origin, it necessarily lacks the promoter elements nec- 
essary for transcription to occur. In general, retrotransposed pseudogenes are 
rarely associated with active promoter elements and are not therefore usually 
expressed. Again there are always exceptions, for example the human glutamine 
synthetase pseudogene which contains a functional promoter/enhancer in its 5’ 
flanking sequence that allows it to be transcribed although not expressed (owing 
to the presence of numerous frameshift mutations) (Chakrabarti et al., 1995). 
Those processed ‘pseudogenes’ which possess both open reading frames and some 
inherent or acquired promoter activity such that they can be expressed, are really 
not pseudogenes at all but rather examples of gene creation by retrotransposition, 
a topic covered in Chapter 9, section 9.6. 

To have been inherited, processed pseudogenes must have originated in the 
germline. It therefore follows that their functional source genes must have been 
expressed in the germline. Consistent with this assumption, many processed 
pseudogenes correspond either to ubiquitously expressed ‘housekeeping’ genes 
(e.g. snRNA genes) or to genes that are known to be expressed in germ cells (e.g. 
B-tubulin; Lewis and Cowan, 1990) (Zable 6.2). However, some processed pseudo- 
genes are derived from transcripts of tissue-specific genes such as those encoding 
the immunoglobulins (Table 6.2). These pseudogenes may have originated from 
ectopic transcripts, the consequence of the ‘leaky’ transcription which appears to 
be a property of every gene in every cell. In a very few cases, processed pseudo- 
genes have been shown to be derived from antisense transcripts (Rozmahel et al., 
1997; Zhou et al., 1992). 


Table 6.2. Examples of human pseudogenes generated by retrotransposition 


Pseudogene Chromosomal location/comments Reference 
Dihydrofolate reductase Unlinked to functional source gene Chen et a/. (1982) 
(2) DHFR (5q11-q13) Shimada et a/. (1983) 


Anagnou et a/. (1985) 


ADP-ribosyltransferase, Chromosomes 13 and 14. Unlinked to Lyn et a/. (1995) 
NAD+ (2) functional source gene ADPAT (1q42). 

Present in gorilla (98% homologous). 

Originated ~27 Myrs ago. 


Ribosomal protein L7 Located within intron 1 of c-fms (CSF7R) Sapi et al. (1994) 


gene (5q33) 
High mobility group Flanked by 15 bp direct repeat. Stros and Dixon (1993) 
protein Originated only ~1 Myrs ago. 
Topoisomerase 1 Chromosome 1. Truncated. Unlinked to Zhou et al. (1992) 


functional source gene 7OP7 (20q12-q13) 


Glutamine synthetase Flanked by 9 bp direct repeat. Chakrabarti et a/. (1995) 
Transcribed. 


FAU proto-oncogene Chromosome 18. Integrated within Kas et a/. (1995) 
sequence homologous to promoter of 
islet amyloid polypeptide (/APP) gene. 317 
amino acid open reading frame. Not 
transcribed. Unlinked to FAU gene (11q13) 
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Pseudogene 


Chromosomal location/comments 


Reference 


28S ribosomal RNA 
Ribosomal protein L23A 


Ribosomal protein L38 


Ferritin L (5) 


Ferritin H (18) 


Glyceraldehyde-3- 
phosphate 
dehydrogenase (5) 


Cytochrome c oxidase 
subunit VIb (3) 


Cytochrome b, (2) 
Prothymosin œŒ (5) 


Argininosuccinate 
synthetase (14) 


Immunoglobulin J 


Immunoglobulin € 


Immunoglobulin À 
Calmodulin (5) 
Phosphoglycerate kinase 


Metallothionein 
B-Actin (3) 


Y-Actin (3) 


B-Tubulin 


Laminin receptor 


Enolase 1 


Flanked by 16 bp direct repeats 


Chromosome 13. Unlinked to 
functional source gene RPL23A (17q11) 


Located in promoter region of type-1 
angiotensin Il receptor gene (AG7A7; 
3q21-q25) 

Unlinked to functional source gene 
FTL (19q13). 

Highly dispersed (1p, 1q, 2q, 3q, 4, 6p, 
13q, 14, 17p, X). Unlinked to functional 
source gene FTH7 (11q12-q13). 


Wang et a/. (1997) 


de Fatima Bonaldo et al. 
(1996) 


Espinosa et a/. (1997) 


Santoro et a/. (1986) 


Dugast et a/. (1990); 
Zheng et a/. (1995; 1997) 


Dispersed including Xp21-p11. Not linked Arcari et a/. (1989) 


to functional source gene GAPD (12p13). 


Unlinked to functional source gene 
COX6B (19q13) 


Partial. Transcribed. 14q31-q32 and 
20p11. Unlinked to functional source 
gene CYB5 (18q23). 


Unlinked to functional source gene 
PTMA (chromosome 2). Not transcribed. 


Highly dispersed. Originated during last 
40 Myrs. Unlinked to functional source 
gene ASS (9q34). 

Chromosome 8q13-q21. Unlinked to 
functional source gene /GJ (4q21). 
Originated ~40-50 Myrs ago. 

Present in Old World monkeys and 
gorilla but appears to have been lost 

in chimpanzee. 


Unlinked to functional source 
immunoglobulin A genes /GLC1 (22q11). 
Unlinked to functional source gene 
CALM7 (14q24-q31). Originated ~21-49 
Myrs ago. 


Chromosomes 6, 15 and 18. Unlinked to 
functional source gene ACTB (7p22) 


Chromosomes 3q23, 20p13, 6p21. 
Unlinked to functional source gene 
ACTG7 (17q25) 

8q21-8pter. Unlinked to functional 
source gene TUBB (6p21-pter). 


Flanked by 18 bp direct repeats. Not 
transcribed. Originated 3.5-5 Myrs ago. 


Chromosome 1941-q42. Unlinked to 
functional source gene ENO7 (1p36-pter) 


Taanman et a/. (1991) 


Giordano et a/. (1993) 


Manrow et a/. (1992) 


Nomiyama et a/. (1986); 
Freytag et a/. (1984) 


Max et a/. (1994) 


Ueda et a/. (1985) 


Hollis et a/. (1982) 


Koller et a/. (1991); 
Rhyner et a/. (1994) 


McCarrey (1990) 
Karin and Richards (1982) 


Moos and Gallwitz (1982; 
1983); Ueyama et a/. (1996) 


Ueyama et a/. (1996) 


Wilde et a/. (1982a); 
(1982b); Floyd-Smith 
et al. (1986) 


Richardson et a/. (1998) 


Ribaudo et a/. (1996) 
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Table 6.2 (continued) 


Pseudogene 


Chromosomal location/comments 


Reference 


Uracil-DNA glycosylase 
(2) 


Tropomyosin, non-muscle 


Located on chromosomes 14 and 16. 
Unlinked to functional source gene 
UNG (12q23-q24) 


Zinc finger protein ZNF75 12q13. Unlinked to functional source 


Adenylate kinase 3 


Nucleotide excision 
repair protein RAD52 


Serotonin 5- 
hydroxytryptamine 1D 
receptor 


Fatty acid-binding 
protein 3 

Cysteine and glycine- 
rich protein 2 


c-Ki-ras1 proto-oncogene 


c-e/k-1 proto-oncogene 


(2) 


Interleukin-6 signal 
transducer 


SHC transforming 
protein 


Mitochondrial-3-glycerol 
phosphate 
dehydrogenase 2 


Mitochondrial elongation 
factor Tu 


Mitochondrial NADH: 
ubiquinone 
oxidoreductase 24-kDa 
subunit 


NADH: ubiquinone oxido 
reductase B13 subunit 


Solute carrier family 9 
(NHE3) 

B-raf 

GTP-binding protein Gq 


QE-catenin 


gene ZNF75 (Xq26) 


Located within intron 10 of VF7 
gene at 17q11.2 


Chromosome 2. Unlinked to functional 
source gene RAD52 (12p12-q13) 


Transcribed. Capable of being 
translated into 28 amino acid. 
polypeptide. Unlinked to functional 
source gene H7A7D (1p34-p36) 


13q13-q14. Unlinked to functional source 
gene FABP3 (1p32-p33). 

3q21. Unlinked to functional source gene 
CSRP2 (12q21). 

6p12-p11. Unlinked to functional source 
gene KRAS2 (12p12). 

14q32 within /GHV locus. Unlinked to 
functional source gene ELK7 (Xp11.2). 
Insertion occurred 30-60 Myrs ago 

prior to /GHV locus duplication. 


17p11. Unlinked to functional source 
gene /L6ST (5q11). 

Xq12-q13. Unlinked to functional 
source gene SHC7 (1q21). 


Chromosome 17. Unlinked to functional 
source gene GPD2 (2q24) 


17q11. Unlinked to functional source 
gene TUFM (16p11) 


19q13.3-qter. Unlinked to functional 
source gene VDUFV2 (18p11.2-p11.31). 
Partially processed. 


11p15. Unlinked to functional source 
gene NDUFAS (7932) 


Chromosome 10. Unlinked to functional 
source gene SLC9A3 (5p15) 

Xq13. Unlinked to functional source 
gene BRAF (7q34) 

2q14-q21. Unlinked to functional source 
gene GNAQ (9q21) 


5q22 [Unlinked to functional 
source gene CTNNA1 (5q11) 


Lund et a/. (1996) 
MacLeod and Ta bot (1983) 
Villa et a/. (1996) 

Xu et a/. (1992) 

Johnson and Campbell 


1996) 


Nguyen et a/. (1993); 
Bard et a/. (1995) 


Prinsen et a/. (1997) 


Weiskirchen et a/. (1997) 


McGrath et a/. (1983) 


Harindranath et a/. 
(1998) 


Rodriguez et a/. (1995) 


Harun et a/. (1997) 


Brown et a/. (1996) 


Ling et a/. (1997) 


de Coo et a/. (1995) 


Russell et a/. (1997) 


Kokke et a/. (1996) 


Eychene et a/. (1992) 


Dong et a/. (1995) 


Nollet et a/. (1995) 
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Pseudogene 


Chromosomal location/comments 


Reference 


Acyl-CoA binding protein 


Methylthioadenosine 
phosphorylase 


Serine 
hydroxymethyltransferase 


Hexokinase II 


Dihydrolipoamide 
succinyltransferase 


S-adenosylmethionine 
decarboxylase 
Dual specificity 
phosphatase 8 


G protein-coupled 
receptor kinase GRK6 


Estrogen-related 
receptor ERRO 


Phosphatidylinositol 
glycan class F 


Nucleophosmin (7) 
RNA helicase, DDX10 
Transcriptional 


elongation factor TFIIS 
U1 snRNA 


U2 snRNA 


U3 snRNA 
U4 snRNA 


U6 snRNA 
U13 snRNA 


Chromosome 6. Unlinked to functional 
source gene DB/ (2q12-q21). 


3q28. Unlinked to functional source 


gene MTAP (9p21). 


1p32-33. Unlinked to functional 
source gene SHMT1 (17p11-p12). 

X chromosome. Unlinked to functional 
source gene HK2 (2p12). Originated 
14-16 Myrs ago. Integrated into LINE 


element. 


1p31. Unlinked to functional source 


gene DLST (14q24) 


Xq28. Unlinked to functional source 


gene AMD7 (6q21-q22). 


10q11.2. Unlinked to functional source 


gene DUSP8 (11p15). 


Chromosome 13. Unlinked to functional 
source gene GPRK6 (5q35) 


13q12. Unlinked to functional source 


gene ESRRA (11q12-q13). 


Chromosome 5q35. Unlinked to 
functional source gene P/GF (2p16-p21) 


4 full-length, 3 truncated. Unlinked to 
functional source gene VPM7 (5q35) 


9q21-q22. Unlinked to functional source 


gene DDX10 (11q22-q23) 


Unlinked to functional source gene 


TCEA1 (3p21-p22) 


Gersuk et a/. (1995) 
Tran et al. (1997) 


Byrne et a/. (1996); 
Devor and Dill-Devor (1997) 


Ardehali et a/. (1995) 


Nakano et a/. (1994) 
Maric et a/. (1995) 
Nesbit et a/. (1997) 


Gagnon and Benovic 
1997) 


Sladek et a/. (1997) 
Ohishi et a/. (1995) 
Liu and Chan (1993) 
Savitsky et a/. (1996) 
Park et a/. (1994) 


Dennison et a/. (1981); 
Manser and Gesteland 
(1981) 


Van Arsdell and Weiner 
(1984) 


Bernstein et a/. (1983) 


Hammarström et al. 
(1982) 


Hayashi (1981) 
Baserga et a/. (1991) 


Numbers in brackets denote numbers of pseudogenes in cases where more than one has been found. 


Some genes possess only a single processed pseudogene copy whereas others 
(e.g. the U1 and U6 small nuclear RNA (snRNA) genes) can possess hundreds. In 
some cases, the processed pseudogenes greatly outnumber their functional coun- 
terparts. One example is that of the human ribosomal protein multigene family 
(Chapter 5, section 5.1.14) whose members are composed predominantly of mul- 
tiple processed pseudogenes (Davies et al., 1989). Other examples are given in 
Table 6.2. The proportion of processed pseudogenes to functionally active genes in 
any one gene family may reflect the level of transcription of the source gene in the 
germline. It is also likely to be influenced by the private sequence characteristics 
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of the source gene which will determine both the efficiency of priming of reverse 
transcriptase and of their eventual integration of reverse transcripts. Thus, for 
example, the plethora of U3 snRNA pseudogenes may be explicable in terms of 
the inherent ability of U3 snRNA to act as a self-priming template for reverse 
transcriptase (Bernstein et al., 1983). Both human and murine cells possess an 
endogenous reverse transcriptase activity (Maestre et al., 1995; Tchenio et al., 
1993) although it is unclear whether this is derived from retroviral infection 
(Carlton et al., 1995), endogenous sources of reverse transcriptase such as retrovi- 
ral elements (Chapter 1, section 1.4.4) or, perhaps more likely, LINE elements 
(Dhellin et al., 1997; Jurka, 1997; Chapter 1, section 1.4.3). 

Some human processed pseudogenes have been inserted into the introns of 
other genes, for example a ribosomal protein L7 pseudogene in intron 1 of the 
c-fms (CSFIR; 5q33) proto-oncogene (Sapi et al., 1994), a phosphoglycerate 
mutase pseudogene in intron 1 of the Menkes disease (ATP7A; Xq13.3) gene 
(Dierick et al., 1997), an L5 ribosomal protein pseudogene in an intron of the 
small nuclear ribonucleoprotein N (SNRPN; 15q12) gene (Buiting et al., 1996), an 
L21 ribosomal protein pseudogene in intron 13 of the breast cancer (BRCA1; 
17q21) gene (Smith et al., 1996) and an adenylate kinase 3 pseudogene in intron 10 
of the neurofibromatosis type 1 (NF; 17q11.2) gene (Xu et al., 1992). If integrated 
upstream of a gene, the pseudogene sequence may affect expression of that gene. 
For instance, a processed ribosomal protein S25 pseudogene, which has become 
integrated into the promoter region of a rat class I alcohol dehydrogenase gene, 
inactivated a glucocorticoid response element and contributed a novel suppressor 
element (Cortese et al., 1994). Similarly, the human L38 ribosomal protein 
processed pseudogene has become integrated into the promoter of the type-1 
angiotensin II receptor (AGTRI1; 3q21-q25) gene although the effect, if any, on 
the expression of the gene is unclear (Espinosa et al., 1997). In general, the sites 
into which processed pseudogenes integrate appear to be AT-rich and various 
models have been proposed which attempt to describe the process of integration 
(Vanin, 1984; Wilde, 1985). Co-retrotransposition sometimes occurs with other 
sequences e.g. endogenous retroviral elements (Lyn et al., 1995), LINE elements 
(Rozmahel et al., 1997) and possibly Alu sequences (Koller et al., 1991). 

Some pseudogenes may be duplicated copies of pre-existing processed pseudo- 
genes, for example the tandemly arranged proliferating cell nuclear antigen 
pseudogenes on chromosome 4q34 (Taniguchi et al., 1996), the L7a ribosomal pro- 
tein pseudogenes associated with the closely linked lymphotactin a (LTNA) and 
B (LTNB) genes on chromosome 1q23 (Yoshida et al., 1996) or the elk-1 pseudo- 
gene associated with the immunoglobulin heavy chain locus (GHV) on 14q32 
(Harindrath et al., 1998). 

A highly unusual presence/absence polymorphism has been noted for a human 
dihydrofolate reductase processed pseudogene (Anagnou et al., 1984). This 
pseudogene is present in only a proportion of individuals and this proportion 
varies between racial groups. However, it is unclear whether the polymorphism is 
a consequence of the recent acquisition of the pseudogene or whether it instead 
results from its occasional loss. 
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6.1.3 The generation of pseudogenes by other means 


Although the vast majority of known pseudogenes are either inactivated genomic 
copies (partial or complete) of functional genes or processed pseudogenes, there 
are a few unusual examples which do not fit easily into either of these two group- 
ings. One such case is that of the hybrid B-5-globin pseudogene found in lemurs 
(Jeffreys et al., 1982). The first and second exons plus intron 1 appear to be derived 
from the wB-globin pseudogene whilst intron 2, exon 3, and the 3’ UTR origi- 
nated from the 6-globin gene. The hybrid pseudogene probably originated from 
an unequal crossover which fused sequences derived from the wB-globin pseudo- 
gene and the -globin gene. 

The cadherin pseudogene on human chromosome 5q13 described by Selig et al. 
(1995) is also unusual. It contains one exon derived from the source gene encod- 
ing cadherin 12 (CDH12; 5p13-p14) plus flanking introns, an open reading frame 
of 794 amino acids, and is transcribed at a 10-fold higher rate than the source 
gene. Selig et al. (1995) proposed that this pseudogene had been only partially 
processed in so far as the intron sequences had not been completely removed from 
the mRNA before reverse transcription and integration had taken place. Other 
examples of semi-processed human pseudogenes include one at 1q24-q25 derived 
from the MADS box transcription factor 2 (MEF2A; 5q26) gene (Suzuki et al., 
1996) and another at 19q13.3-qter derived from the mitochondrial NADH: 
ubiquinone oxidoreductase (VDUFV2; 18p11.2-p11.31) gene (de Coo et al., 1995) 
and at least 13 pseudogenes which have independently originated from an identi- 
cal mis-spliced transcript of the protein geranylgeranyltransferase type If 
(PGGTI1B) gene (Dhawan et al., 1998). 


6.1.4 Origin and age of pseudogenes 


Some indication of the relative age of a pseudogene can be obtained by the extent 
of homology between the pseudogene and its parent or source gene. Thus, in the 
human a-globin gene cluster on chromosome 16, the wC-globin pseudogene is 
>99.5% homologous to its parental C-globin (HBZ; 16p13.3-pter) gene whereas 
the rather older yo-globin pseudogene is only 75-80% homologous to its parental 
a-globin (HBA1/; 16p13.3-pter) gene. 

The approximate age of a pseudogene can be estimated by determining whether 
or not an orthologous sequence is present at an identical location in other related 
species of known phylogeny. Thus, the XG blood group pseudogene is present in 
the great apes but not in Old World monkeys, New World monkeys or prosimians 
(Weller et al., 1995) whilst multiple copies of the keratinocyte growth factor 
pseudogene are present in chimpanzee, gorilla but not in gibbons or Old World 
monkeys (Kelley et al., 1992). Similarly, the wBl-globin pseudogene has been 
found in prosimians and New World monkeys as well as the anthropoid apes and 
human and is therefore thought to have originated very early in primate evolution 
(Harris et al., 1984). Rouquier et al. (1998) noted distinct human- and gorilla- 
specific inactivating mutations in the olfactory receptor pseudogene (termed 
912-93) that is located on human chromosome 11q1l1-q12 and which exhibits 
synteny in the hominoid primates (Rouquier et al., 1998). Finally, the serine 
hydroxymethyltransferase processed pseudogene, derived from the SHMTI gene 
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on chromosome 17p11.2, arose during the evolution of the primates but different 
lesions have occurred in different phylogenetic branches of the family making 
this pseudogene a potentially useful marker in molecular studies (Devor et al., 
1998). 

Attempts have also been made to estimate the age of pseudogenes (time since 
duplication and/or inactivation) by comparing the proportion of silent and 
replacement changes in the pseudogene sequence to that exhibited by the active 
parent gene. Such estimates assume that following inactivation, pseudogenes will 
accumulate mutations at the same rate as silent changes in active genes. However, 
what is often ignored is that there is still some selection against silent changes in 
functional genes. Moreover, such estimates are extremely sensitive to the con- 
founding effects of gene conversion (see Section 6.1.6 and Chapter 9, section 9.6) 
whose homogenizing influence serves continually to restart the molecular clock 
leading to the underestimation of the age of the pseudogene. 


6.1.5 Patterns of mutation in pseudogenes 


Pseudogenes represent extremely useful tools for the study of mutation because, 
since they lack a biological function and are not subject to selective constraints, 
all mutations that occur in pseudogenes are selectively neutral and will become 
fixed in the population with equal probability. 

Pseudogenes exhibit a very high rate of nucleotide substitution with the CpG 
dinucleotide being a hotspot for mutation (Bulmer et al., 1986; Gojobori et al., 
1982; Li et al., 1981; Li et al., 1984; Miyata and Hayashida, 1981). By contrast, 
deletions occur about once every 40 nucleotide substitutions whilst insertions 
occur about once every 100 nucleotide substitutions (Ophir and Graur, 1997). The 
age of the pseudogene, however, is not always linearly related to the numbers of 
deletions and insertions present (Ophir and Graur, 1997), and this may be due at 
least in part to gene conversion (see Section 6.1.6). 

The rate of DNA loss through deletion from processed pseudogenes appears to 
be considerably higher in rodents than in humans (Graur et al., 1989; Ophir and 
Graur 1997). Interpretation of such interspecies comparisons can however be con- 
founded by possible differences in the frequency of gene conversion between 
species. 

The mutation rate within processed pseudogenes appears to be higher than in 
regions flanking the site of insertion (Casane et al., 1997). This is potentially 
explicable if one considers that those sites which are inherently the most mutable 
have been maintained by selection within the retrotransposed sequence up until 
pseudogene formation whereas such sites have been removed in the flanking 
regions which have been unconstrained by the effects of natural selection for 
much longer periods of evolutionary time. 


6.1.6 Pseudogenes and gene conversion 


Gene conversion is the modification of one of two alleles by the other. It involves 
the nonreciprocal correction of an ‘acceptor’ gene or DNA sequence by a ‘donor’ 
sequence which remains physically unchanged. In most known cases, gene 
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conversion has occurred between highly homologous and closely linked gene 
sequences (Cooper and Krawczak, 1993). Examples of gene conversion that have 
occurred during primate evolution are discussed in Chapter 9, section 9.5. Gene 
conversion has also served as a mutational mechanism causing human inherited 
disease; probable examples involve the genes and pseudogenes for steroid 21- 
hydroxylase (CYP21; 6p21.3; Tusié-Luna and White, 1995), polycystic kidney 
disease (PKD1; 16p13), neutrophil cytosolic factor p47-phox (NCFI; 7q11.2; 
Gorlach et al., 1997), immunoglobulin l-like polypeptide 1 UGLLJ; 22q11; 
Minegishi et al., 1998), glucocerebrosidase (GBA; 1q21; Eyal et al., 1990), von 
Willebrand factor (VWF; 12p13; Eikenboom et al., 1994), and phosphomanno- 
mutase (PMM2; 16p13; Schollen et al., 1998). These gene/pseudogene pairs are all 
closely linked with the exception of the VWF gene (12p13) and its pseudogene 
(22q11-q13), and the PMM2 gene (16p13) and its pseudogene (18p). Together, 
these two exceptions would seem to establish a precedent for the occurrence of 
gene conversion between unlinked loci in the human genome. 

Could pseudogenes (both the highly dispersed processed variety and/or the 
closely linked duplication-derived pseudogenes) also have templated advanta- 
geous changes in their single copy functional source genes over evolutionary 
time? Certainly the converse is true since sequence changes in pseudogenes can be 
templated by their functional homologues (DeBry, 1998). If pseudogene-tem- 
plated changes in functional genes were not deleterious, they could eventually 
have become fixed in populations or even species. Thus, in principle, pseudo- 
genes, whether processed or non-processed, could act as a reservoir of sequence 
variation which could at some stage be transferred back to the functional gene. In 
this way, different mutational combinations might be put together within the 
pseudogene, all the time being shielded from selective pressure, and then func- 
tionally tested after simultaneous transfer to the expressed gene copy. For any one 
gene, the contribution of pseudogene-mediated gene conversion to the process of 
evolutionary change might be expected to depend on: 


(i) The number of homologous pseudogenes (processed or unprocessed) in the genome that 
are homologous to the gene in question. 

As we have seen in Section 6.1.2, this can vary by at least two orders of mag- 
nitude. 

(ii) Whether the pseudogenes are linked or unlinked to the functional gene. 

Gene conversion does occur between unlinked homologous sequences 
(Fitzgerald et al., 1996; Murti et al., 1994). Gene conversion between multi- 
copy, dispersed retrotransposons such as Alu sequences (Kass et al., 1995) and 
LINE elements (Burton et al., 1991) has also been reported. However, inter- 
chromosomal gene conversion events are predicted to be much less frequent 
than intra-chromosomal events (Liao et al., 1997). 

(iii) The length of sequence involved in the gene conversion event. Were this to be too 
large, inactivating mutations present in the pseudogene would be more likely 
to be transferred to the functional gene thereby inactivating it rather than 
altering it. In practice, however, gene conversion events are often quite local- 
ized (Kim et al., 1993; Pamilo and Bianchi, 1993; Zhou and Lee, 1996), one 
example being the HLA-DQBI gene (6p21.3) in which gene conversion 
events have been confined to exon 2 without extending into the adjacent 
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introns (Bergstrom et al., 1998). Some gene conversion events can neverthe- 
less involve rather longer stretches of DNA, for example over the entire rhe- 
sus monkey yl and y2 globin genes (Slightom et al., 1988) and over the entire 
hominoid immunoglobulin Ca genes (Kawamura et al., 1992). The interpre- 
tational problem in such studies lies in the fact that it is difficult to distin- 
guish a single large gene conversion event from multiple overlapping short 
gene conversions. 

(iv) The age of the pseudogenes. The degree of sequence homology to the functional 
gene will decay with time and this could serve to reduce the frequency of gene 
conversion as well as the length of sequence involved. Further, the number of 
inactivating mutations acquired by the pseudogene will increase with time 
and this would have the effect of reducing the likelihood of a functionally 
productive gene conversion event. 


At the present time, definitive evidence for pseudogene-mediated gene conver- 
sion is fairly sparse. Since by its very nature, gene conversion tends to cover its 
own traces, such evidence may prove hard to obtain. Further, although the donor 
sequence might be a pseudogene in an extant genome, this does not mean that it 
was necessarily a pseudogene at the time of the gene conversion event. The 
demonstration that the sequence in question has been accumulating mutations 
over a considerable period of time would provide evidence in favor of an ancient 
inactivating event. Again, however, gene conversion could confound estimates of 
the length of evolutionary time elapsed. The above notwithstanding, one possible 
example of pseudogene-mediated gene conversion involves a human 
immunoglobulin V,, pseudogene (V4—55P) which may have served as a donor 
sequence in the conversion of two functional V,, JGHV; 14q32) genes, V4—4b and 
V4-28 (Haino et al., 1994). Similarly, a human pseudogene in the human growth 
hormone gene cluster may have templated sequence changes in a functional 
source gene: the growth hormone 1 (GH1; 17q22-q24) gene promoter region con- 
tains five single base polymorphic alleles (at -57,—1, +3, +16, and +26) which are 
identical to the bases present at the homologous locations in the promoter of the 
closely linked and evolutionarily related pseudogene (CSHL1) but are different to 
those present in the more distal CSH1, GH2 and CSH2 genes (Giordano et al., 
1997). A gorilla HLA-A gene has been shown to be similar to an HLA-AR pseudo- 
gene only in exon 2 with the remainder of the gene being closely related to other 
primate A locus genes (Watkins et al., 1991). Finally, immunoglobulin V gene 
diversity in chickens is known to be increased by gene conversion during B-cell 
development; the germline pool of donor sequence information for somatic gene 
conversion is found in the families of V pseudogenes located 5’ to the single func- 
tional V gene at each locus (McCormack et al., 1993). 


6.1.7 Pseudogene reactivation 


Pseudogenes may be by definition inactive but this does not mean that they can 
never regain their activity. Indeed, Marshall et al. (1994) calculated that the resur- 
rection of pseudogenes is probabilistically feasible within about 6 Myrs of forma- 
tion but that it is unlikely after more than 10 Myrs have elapsed owing to the 
accumulation of multiple inactivating mutations. 
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One putative example of the natural resurrection of a pseudogene is that of a 
bovine seminal ribonuclease gene which appears to have become reactivated after 
the divergence of the kudu but before the divergence of the ox, between ~5 Myrs 
and 10 Myrs ago (Trabesinger-Ruef et al., 1996). Since the seminal ribonuclease 
genes are highly homologous to the pancreatic ribonuclease gene family (the sem- 
inal proteins emerged about 35 Myrs ago by a process of duplication and diver- 
gence from a pancreatic protein gene), the repair of the seminal ribonuclease gene 
may have been effected by gene conversion, involving the transfer of genetic 
information from a pancreatic ribonuclease gene. 

An example of the natural reactivation of a pseudogene in humans is that of a 
truncated yE crystallin gene (CRYGEP1) found in association with Coppock-like 
cataract (Brakenhoff et al., 1994). As we have seen in Chapter 4, section 4.2.1, 
Crystallin genes, the y-crystallin gene cluster at 2q33-q35 comprises four genes and 
two pseudogenes, yE and yF, both of which contain in-frame stop codons. In the 
abovementioned patient, a number of sequence changes occurred within and 
around the TATA box in the promoter region of the yE pseudogene which appear 
to have resulted in a 10-fold increase in promoter activity. The predicted product 
of the yE pseudogene is a 6 kDa N-terminal y-crystallin fragment which could be 
responsible for the increased opacity of the lens in the patient. 

In an evolutionary context, the reactivation of the human wC1-globin pseudo- 
gene (HBZP; 16p13) by removal of its sole inactivating mutation appears to have 
occurred by gene conversion (Hill et al., 1985). This change, which has probably 
been templated by the neighboring C-globin (HBZ; 16p13) gene, is a common 
polymorphism in a number of different populations (Hill et al., 1985). 

The above examples of the reactivation of pseudogenes were naturally occur- 
ring. Reactivation of an inactivated pseudogene can however also be effected arti- 
ficially by correction of the inactivating mutation(s) and restoration of the 
sequence originally present prior to inactivation. One example of this is the arti- 
ficial reactivation by im vitro mutagenesis of the human T-cell receptor y variable 
V10 pseudogene which was originally inactivated by a single base-pair substitu- 
tion in a donor splice site (Zhang et al., 1996). Such experimental studies serve to 
demonstrate that the corrected lesion was the sole impediment to expression and 
therefore likely to be the original inactivating mutation. 


6.2 Gene loss/inactivation in primates 


Genes can be created but they may also be destroyed. As we have seen, one form 
of gene inactivation involves pseudogene formation. This process involves the 
creation of a copy of a specific gene (whether cDNA or genomic) followed by its 
inactivation by the sudden acquisition, and subsequent steady accumulation of, 
deleterious mutations. Owing to the redundancy of genetic information available 
upon gene duplication, there is often no selective pressure to retain the integrity 
of the new sequence because the sequence from which it was derived still exists 
and functions normally. 

By contrast, the loss or inactivation of a single copy essential gene is likely to 
have a dramatic phenotypic effect. Many of the lesions responsible will be evident 
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by their clinical manifestations (assuming that they are not embryonic lethal) and 
we should expect them to be rapidly removed from the population by natural 
selection. On the other hand, the inactivation of a nonessential gene could in 
principle have relatively few deleterious consequences and might even have a 
beneficial effect under the appropriate circumstances. One possible example of 
this is the C4b-binding protein B-chain gene (C4BPB; 1q32) gene which occurs 
as a single copy functional gene in the human genome but has been inactivated in 
the mouse (Rodriguez de Cordoba et al., 1994). That the loss of this gene has been 
fixed evolutionarily in the murine lineage suggests that it is at least neutral with 
respect to fitness. It’s loss may even have conferred a selective advantage 
(antithrombotic effect?) in which case, it could have become fixed within a rela- 
tively short period of evolutionary time. Several similar examples of the inactiva- 
tion of single copy genes from primate genomes have also been documented and 
these single copy pseudogenes will now be described in some detail. 


6.2.1 Urate oxidase gene 


Urate oxidase is a copper binding enzyme found in most vertebrates which catal- 
yses the conversion of uric acid to allantoin. Although found in Old World mon- 
keys and in the majority of New World monkeys, urate oxidase activity has been 
lost in the hominoid line. The human urate oxidase (UOX; 1p22) ‘gene’ was iso- 
lated by screening a genomic library with a porcine cDNA probe (Wu et al., 1989). 
It was found to contain two nonsense (CGA->TGA) mutations at codons 33 and 
187 (Wu et al., 1989; Yeldandi et al., 1990). These lesions were also found to be pre- 
sent in chimpanzee and gorilla but only the codon 33 mutation was detected in 
orangutan (Yeldandi et al., 1991). This implies that the codon 33 mutation was the 
original inactivating mutation and that it must have occurred before the diver- 
gence of orangutan from the chimpanzee/human line, between 7 Myrs and 13 
Myrs ago. Urate oxidase activity is also absent in the gibbon but this has been 
shown to be due to a 13 bp deletion in exon 2 of the Uox gene (Wu et al., 1992). The 
loss of urate oxidase activity in primates has therefore been due to the occurrence 
of at least two distinct gene lesions in different lineages. That this gene could have 
been lost in primates as a result of at least two independent mutational events is 
consistent with its loss being advantageous. Although somewhat far-fetched per- 
haps, it has been suggested that since uric acid is a potent antioxidant, an 
increased uric acid concentration might have contributed to a lowering of the 
somatic mutation rate with a consequent increase in hominoid lifespan. The loss 
of urate oxidase may however be responsible for at least some cases of renal stones 
and gouty arthritis in humans. 


6.2.2 a-1,3-Galactosyltransferase gene 


a-1,3-Galactosyltransferase (a-1,3 GT) is responsible for the synthesis of the 
a-galactosyl epitope present in the cell surface receptors of most mammals includ- 
ing prosimians and New World monkeys. However, the catarrhines (Old World 
monkeys, apes and humans) appear to lack this epitope and produce large 
amounts of antibodies against it. Two inactivating single base-pair deletions (del 
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C822 and del G904) were found in the human o-1,3 GT (GGTA1) ‘gene’, located 
at chromosome 9q34 (Larsen et al., 1990). Since chimpanzees possess both dele- 
tions whereas orangutan and gorilla only have del904, it was reasoned that del904 
was the original inactivating mutation (Galili and Swanson, 1991). Old World 
monkeys were found not to possess either lesion and no other inactivating muta- 
tion was obvious. Clearly, as with urate oxidase, at least two distinct inactivating 
mutations have occurred in apes and Old World monkeys. Selective pressure to 
suppress o-galactosyl transferase expression may have been mediated by a 
pathogen which infected primates via cell surface receptors containing the 
a-galactosyl epitope. Alternatively, the pathogen might have expressed a-galacto- 
syl epitopes and an effective host immune response would have required the sup- 
pression of autologous a-galactosyl epitope synthesis. 


6.2.3 Elastase | gene 


The human pancreatic elastase I (ELA; 12q13) ‘gene’ is transcriptionally silent 
even though the coding sequence and splice junctions appear to be intact. Studies 
of the corresponding (transcriptionally active) rat gene have demonstrated that 
some 205 bp immediately upstream of the transcriptional start site are required 
for pancreatic expression. This regulatory region comprises an enhancer between 
-205 and -72 that contains all the information necessary for tissue-specific 
expression, and a minimal promoter element between —71 and +8. The enhancer 
contains three functional elements (A, B, and C), two of which (A and B) bind 
pancreas-specific transcription factors (PTF1 and PDX1) that mediate the cell 
type specificity of the enhancer. The 5’ flanking region of the human ELA] ‘gene’ 
was studied by Rose and MacDonald (1997). Its nucleotide sequence differs by 
28% from its rat counterpart whilst its enhancer/promoter strength (as measured 
by reporter gene assays) was 2.6% that of the homologous rat sequence. 
Comparison of the rat and human gene 5’ flanking regions revealed a total of 13 
nucleotide differences within the A, B, and C elements. A combination of the 
three human enhancer elements, ‘repaired’ by reference to the rat sequence, 
together with the rat promoter, partially restored the activity of the human 
enhancer (Rose and MacDonald, 1997). These authors went on to demonstrate 
that two mutations in the A element and four mutations in the B element served 
to abolish the binding of the cognate transcription factors. The degree to which 
these lesions individually exert a deleterious effect on enhancer function will not 
be apparent without further functional studies. In addition, without phylogenetic 
studies of the ELA/ genes from other mammals/primates, it is unclear in which 
order these mutations occurred during evolution. This notwithstanding, we may 
still conclude that the silencing of the human ELAI ‘gene’ has come about 
through the mutational inactivation of the upstream enhancer rather than 
through the alteration of the coding region. 


6.2.4 t-gulono-y-lactone oxidase gene 


Primates (including humans) and guinea pigs have at least one thing in common: 
they are, unlike other mammals, unable to synthesize L-ascorbic acid from 
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D-glucose owing to a deficiency of the enzyme L-gulono-y-lactone oxidase. Using 
a rat CDNA as a probe, Nishikimi et al. (1994) isolated the once active human L- 
gulono-y-lactone oxidase (GULOP) ‘gene’ which is located at chromosome 
8p21.1. A considerable number of inactivating mutations were found including 
the deletion of exon VIII, two 1 bp deletions, one 1 bp insertion and two nonsense 
mutations. The presence of two non-consensus dinucleotides at donor splice sites 
(GC and GG instead of GT) also suggested the introduction of potentially inacti- 
vating mutations at splice sites. Since both Old World and New World monkeys 
are deficient in L-gulono-y-lactone oxidase whilst prosimians possess this 
enzyme, its loss must have preceded the divergence of New World from Old 
World monkeys (~45 Myrs ago) but occurred after the divergence of the prosimi- 
ans from the simian line (50-65 Myrs ago). 


6.2.5 ADP-ribosyltransferase 1 gene 


ADP-ribosyltransferase 1 (RT6) is a glycosyl phosphatidylinositol-anchored cell 
membrane protein expressed in peripheral T cells and intra-epithelial lympho- 
cytes. In the rat, expression of RT6 has been used to distinguish subsets of T cells 
and a defect in RT6 expression has been associated with susceptibility to autoim- 
mune type I diabetes. Haag et al. (1994) have shown that the human (ART2P; 
11p15) and chimpanzee ‘genes’ contain three nonsense mutations, at codons 47, 
141, and 193 respectively, which serve to inactivate them. No further copies of the 
ART2P ‘gene’ were detectable in the human genome consistent with its presumed 
status of single copy pseudogene. It is possible that inactivation of the ART2P 
gene conferred a selective advantage if, for example, it resulted in the loss of a 
membrane receptor for a pathogenic virus. It is as yet unclear if loss of the RT6 
protein confers increased susceptibility to disease. 


6.2.6 Haptoglobin gene 


Haptoglobin, a hemoglobin-binding protein, is encoded by three closely linked 
genes in chimpanzee (HP, HPR, and HPP) but only two (HP, HPR; 16q22.1) in 
human. The loss of the HPP ‘gene’ in human occurred after the separation of the 
human and chimpanzee lineages by an unequal homologous crossover that 
deleted most of the the HPP gene (McEvoy and Maeda, 1988). 


6.2.7 Fertilin-o. gene 


In most mammals including macaques and baboons, a-fertilin contributes a sub- 
unit to a heterodimeric membrane glycoprotein on the sperm surface. The 
macaque and baboon possess two a-fertilin genes which appear to lack introns 
(indicative perhaps of a retrotranspositional origin). Only a single copy homo- 
logue (FTNA) is present in the human genome and, although this sequence is 
transcribed in the testis, it represents a nonfunctional pseudogene since it con- 
tains numerous mutations which render translation of the transcript impossible 
(Jury et al., 1997). What is puzzling is that although the 5’ half of the human 
FTNA ‘gene’ is highly homologous to the macaque Ftnal gene, the 3’ half differs 
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markedly from both Ftnal and Ftna2 genes. Jury et al. (1997) attempted to explain 
the creation of a solitary nonfunctional human gene in terms of a recombination 
event between the two ancestral primate Ftna genes. 


6.2.8 T-Cell receptor y V10 variable region gene 


The ‘gene’ encoding the V10 variable region of the T-cell receptor y (TCRG; 7p15) 
has a single base-pair substitution in a donor splice site which serves to inactivate 
the gene in humans (Zhang et al., 1996). The counterparts of this gene in chim- 
panzee and gorilla contain functional splice sites and therefore this lesion must 
have occurred in the human lineage. Restoration of the defective splice site by in 
vitro mutagenesis has been shown to generate a correctly spliced product. 


6.2.9 Cytidine monophospho-N-acety/neuraminic acid hydroxylase gene 


Humans differ from the great apes in that they lack the enzyme CMP-N-acetyl- 
neuraminic acid hydroxylase which, as the name suggests, hydroxylates N-acetyl- 
neuraminic acid. This cell surface sialic acid molecule is thought to be involved 
in cell-cell recognition and in cell-pathogen interactions. The loss of enzyme 
activity in human is caused by a 92 bp deletion in the coding region of the CMP- 
N-acetylneuraminic acid hydroxylase (CMAH; 6p22-p23) gene (Chou et al., 1998; 
Irie et al., 1998). This deletion results in a frameshift leading to premature termi- 
nation of translation. It is specific to the human lineage since it is not found in 
any of the great apes (Chou et al., 1998) but it is not of recent origin since it has 
been found to be present in Caucasians, African Americans, Japanese, Kung bush- 
men and Khwe pigmies (Chou et al., 1998). Chou et al. (1998) speculated that the 
loss of N-acetylneuraminic acid hydroxylation may have had implications for sus- 
ceptibility to infectious disease since some pathogens utilize sialic acids as spe- 
cific binding sites on mammalian cells. 


6.2.10 Flavin-containing monooxygenase 2 gene 


The flavin-containing monooxygenase 2 gene is one of five FMO genes found in 
mammals which encode a series of NADPH-dependent flavoenzymes that are 
capable of catalyzing the oxidation of numerous drugs and xenobiotics. FMO2 is 
expressed predominantly in the lung where it constitutes the major form of FMO. 
The human FMO2 gene (FMO2; 1q) differs from those of other mammals in that 
it possesses a CAG->TAG transition converting Gln472 to a stop codon (Dolphin 
et al., 1998). This lesion predicts a truncated polypeptide lacking 64 amino acids 
from its C-terminus. The FMO2 mRNA transcript is both abundant and stable. 
In vitro expression studies have demonstrated that the truncated protein product 
is translated, correctly targeted to the endoplasmic reticulum but has lost its cat- 
alytic activity. This mutation is not present in the Fmo2 genes of gorilla and chim- 
panzee and must therefore have arisen in the human lineage in the last 5-7 Myrs. 
Interestingly, about 4% of individuals in African populations possess a CAG (Gln) 
triplet at codon 472. It is possible that this CAG allele represents a remnant of the 
original gene sequence in which case the lesion may have occurred during early 
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human history. The alternative is that it is a reverse mutation brought about per- 
haps by gene conversion templated by a related FMO gene. The biochemical and 
toxicological significance of the loss of FMO2 activity in human lung is unclear 
and so it is premature to speculate as to whether its loss conferred a selective 
advantage upon bearers or whether the inactivating lesion could have become 
fixed by genetic drift alone. 


6.2.11 Overview 


Several genes are thus known to have become inactivated during primate/human 
evolution. Inactivation does not however always imply the deletional removal of 
the gene. Rather, the once functional gene sequences often remain in the human 
genome to be detected as the inactive orthologues of their still active counterparts 
in the genomes of lower vertebrates. A variety of subtle lesions are usually respon- 
sible for the inactivation e.g. micro-deletions or insertions, or single base-pair 
substitutions which create in-frame stop codons, alter the invariant bases at splice 
junctions or adversely affect the binding of transcription factors to the promoter 
region. The greater the time that has elapsed since the initial inactivation event, 
the greater the number of mutations that have accumulated, and therefore the 
harder it is to discern the identity of the initial inactivation event. We may only 
speculate as to the reasons for the loss of these genes. In some cases, gene loss may 
have been neutral with respect to fitness and the null allele would have become 
fixed through genetic drift. In the case of the flavin-containing monooxygenase 2 
(FMOZ2) gene, Dolphin et al. (1998) calculated, assuming a generation time of 15 
years and an effective population size of 10 000, that it would have taken some 
600 000 years for the Gln472Term mutation to approach fixation. In the case of 
other genes lost from the human or primate genomes, some selective advantage, 
for example resistance to pathogens may have accrued to bearers of the inactivated 
genes, resulting in the more rapid spread of the mutant allele. 
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PART 3 
MUTATIONAL 
MECHANISMS IN 
EVOLUTION 


Single base-pair 
substitutions 


7.1 Single base-pair substitutions in evolution 


7.1.1 The selectionist and neutralist perspectives 


Nowadays a certain number of believers in evolution do not regard natural 
selection as a cause of it. We must therefore carefully distinguish between two 
quite different doctrines which Darwin popularised, the doctrine of evolution, 
and that of natural selection. It is quite possible to hold the first and not the 
second. 

J.B.S. Haldane The Causes of Evolution (1932) 


The vast majority of mutations that occur are either neutral with respect to fitness 
(defined as the individual’s ability to survive and reproduce) or are disadvanta- 
geous. If they are disadvantageous, they will tend to be removed from the popula- 
tion since their bearers will be less likely to survive and/or reproduce (negative or 
purifying selection). Occasionally, a new mutation confers a selective advantage and 
increases the fitness of individuals bearing it so that it will eventually reach fixa- 
tion (the point at which the allele frequency in the population becomes 100%). 
This is termed positive selection. Selection is thus nothing more than the differen- 
tial and nonrandom reproduction of genotypes resulting from the superior or 
inferior fitness of their associated phenotypes. In vertebrates, direct evidence at 
the molecular level for the occurrence of selection, whether positive or negative, 
has however often been hard to obtain and there are as yet relatively few good 
practical examples (reviewed in Chapter 2, section 2.3.7 and Sections 7.1.2 and 
7.5.2). 

Gene evolution does not however invariably require selection since changes in 
allele frequency can also occur by chance owing to random sampling of gametes 
(genetic drift). Whereas selection implies directed change, genetic drift may be 
viewed as a stochastic process of undirected change. Genetic drift can cause rapid 
changes in small populations but its effect will be fairly minimal in large ones. This 
is one of the central conclusions of the neutral theory of molecular evolution (Kimura, 
1983). The importance of this theory to our understanding of the evolutionary 
process cannot be understated. Its major points may be summarized as follows: 


Human Gene Evolution, David N. Cooper. 
© 1999 BIOS Scientific Publishers Ltd, Oxford. 297 


298 


(i) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


(vii) 
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Most polymorphic variation within a species and most amino acid substitu- 
tions between species are likely to be neutral with respect to selection. This 
is not to say that variant alleles have no biological function or phenotypic 
effect, merely that the alleles are indistinguishable (neutral) in terms of 
their function. 

The amino acid substitution rate can vary between genes and this is likely 
to be related to the nature and extent of structural and functional con- 
straints upon the different protein products. 

Those parts of a protein with the fewest functional constraints will evolve 
most rapidly. Conversely, those regions of a protein which are functionally 
important will be those that are most highly conserved evolutionarily. 
Genes whose protein products have a stable function manifest a molecular 
clock i.e. the amino acid substitution rate is similar for a given gene in dif- 
ferent evolutionary lineages. 

A high level of genetic polymorphism will exist and this is predicted to 
increase with population size since the probability of fixation of a specific 
allele is lower in large populations. 

In coding sequences, nucleotides in the third codon position evolve faster 
than the first two positions. The proportion of substitutions in the third 
position should be higher for proteins that are evolving more slowly. 

Any change in the function of a gene sequence will serve to alter the pro- 
portion of amino acid substitutions that are neutral. Genes that are no 
longer capable of expression (pseudogenes) will accumulate mutations 
rapidly (Figure 7.1) and are destined eventually to lose recognizable homol- 
ogy to the original sequence. 


(viii) Sequences or portions of sequences that are evolving faster than the neutral 


(ix) 


rate for the species are likely to have been subject to positive selection. 
Genes manifesting a high rate of variation should also evolve at a higher 
rate. 
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Figure 7.1. Average rates of substitution in different parts of genes and in pseudogenes 
(Li and Graur, 1991). 
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7.1.2 Synonymous and nonsynonymous substitutions 


The preservation of favourable variations and the rejection of injurious varia- 
tions, I call Natural Selection, or Survival of the Fittest. Variations neither 
useful nor injurious would not be affected by natural selection and would be 
left a fluctuating element. 

Charles Darwin (1859) The Origin of Species 


Rates of nucleotide substitution vary between different gene regions, tending to 
be higher in introns and flanking regions than in coding sequences (Li and Graur, 
1981; Figure 7.1). Single base-pair changes in noncoding regions do not usually 
affect gene expression unless they occur in a promoter or regulatory region or 
alternatively impair mRNA splicing efficiency. Such changes are not usually sub- 
ject to the effects of negative selection and may therefore accumulate in the 
genome. Using anonymous DNA segments, Cooper et al. (1985) estimated the 
unique DNA sequence heterozygosity in the human genome to be 0.0037, indi- 
cating that ~1/250 bases vary polymorphically in human populations and that 
most of these variants may be expected to occur in noncoding regions. More 
recently, and employing rather larger sample sizes, other workers have estimated 
that polymorphisms in the human genome occur with a frequency of between 
1/200 and 1/1000 bp (Collins et al., 1997; Li and Sadler, 1991; Nickerson et al., 
1998; Wang et al., 1998; see Chapter 1, section 1.2.2). 

In coding regions, single base-pair substitutions that are silent are termed syn- 
onymous in that they do not change the amino acid sequence of the gene product. 
Most of these changes occur at the third base positions of codons. Such mutations 
are likely to be neutral with respect to fitness, assuming that they do not alter gene 
expression or mRNA splicing, stability or transport. Nonsynonymous mutations, 
on the other hand, alter the amino acid sequence of a polypeptide and can be 
either conservative or nonconservative. Conservative substitutions (e.g. Asp—>Glu, 
Leu Val) result in the replacement of one amino acid by another that is chemi- 
cally similar to it and may thus alter protein structure and function only mini- 
mally (Creighton, 1993). Nonconservative substitutions (e.g. Arg—-Gly, 
Arg—Pro), however, result in the replacement of one amino acid by another that 
possesses a chemically dissimilar side chain thereby altering either its charge or 
its polarity. Nonconservative substitutions are likely to have a more deleterious 
effect than conservative substitutions and will therefore tend to be removed from 
the population with a higher probability (Creighton, 1993). The rates of nonsyn- 
onymous and synonymous substitutions per site have been obtained for a variety 
of eukaryotic genes and are of the order of 0-2 x 10° and 2-12 x 10°, respectively 
(MacIntyre, 1994; Li, 1997). The study of 1880 human-rodent orthologue pairs 
yielded average estimates of 0.52 x 10° and 2.92 x 10° for nonsynonymous and 
synonymous substitutions, respectively (Makalowski and Boguski, 1998). In gen- 
eral, eukaryotic pseudogenes evolve at approximately the same rate as the syn- 
onymous sites in functional genes (Li et al., 1981), a rate that can therefore be 
considered to represent the neutral rate of substitution. 

Owing to the nonrandom nature of the genetic code, different amino acids are 
encoded with different degrees of degeneracy. Base positions in codons can thus 
belong to any one of three classes: 
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(i) Nondegenerate sites (~65% of base positions in human codons): all possible 
substitutions are nonsynonymous. The base substitution rate at these sites is 
very low (Figure 7.1) on account of the strong selection pressure to avoid 
amino acid exchanges. 

(ii) Two-fold degenerate sites (~19% of base positions in human codons): base posi- 
tions at which only one of the three possible substitutions is synonymous. 

(iii) Four-fold degenerate sites (~ 16% of base positions in human codons): base posi- 
tions in which all three possible substitutions are synonymous. The base sub- 
stitution rate at these sites is higher than those noted for non-degenerate sites 
and two-fold degenerate sites (Figure 7.1). The comparison of human and 
rodent genes has allowed the rates of transitional and transversional silent 
substitutions in four-fold degenerate sites to be estimated (1.71 x 10° and 
1.22 x 10° per site per year, respectively) since the human-rodent divergence 
(Collins and Jukes, 1994). 


7.1.3 Positive and negative selection in protein evolution 


It may be said that natural selection is daily and hourly scrutinizing, through- 
out the world, the slightest variations; rejecting those that are bad, preserving 
and adding up all that are good; silently and invisibly working, whenever and 
wherever opportunity offers, at the improvement of each organic being in 
relation to its organic and inorganic conditions of life. 

Charles Darwin (1859) The Origin of Species 


Protein coding genes exhibit considerable variation in their rates of synonymous 
(k,) and nonsynonymous (k,) substitutions (reviewed by Li, 1997). The k/k, ratio 
serves as a rough guide to the degree of evolutionary conservation and therefore 
the functional constraints placed upon a protein by natural selection (Li et al., 
1985). Thus, some human gene sequences are very highly conserved (k,/k, ratio 
high), for example ubiquitins (Vrana and Wheeler, 1996), histones H3 and H4 
(Thatcher and Gorovsky, 1994), ribosomal proteins (De Falco et al., 1993), H19 
(Hurst and Smith, 1999), calmodulin (Thomas and Wilson, 1991), and the G pro- 
tein o-subunits (Yokoyama and Starmer, 1992) indicating that these proteins have 
evolved under negative or purifying selection. By contrast, other human/primate 
genes have evolved very rapidly (k/k, less than unity), for example the sex deter- 
mining locus, SRY (Yp11.3; Whitfield et al., 1993), and those genes encoding 
apolipoprotein C-I (APOCI; 19q13.2; Pastorcic et al., 1992), protamines P1 and 
P2 (PRM1, PRM2; 16p13.3; Retief and Dixon 1993), myelin proteolipid protein 
(PLP; Xq22; Kurihara et al., 1997), pregnancy-specific glycoprotein 1 and carci- 
noembryonic antigen (PSG/, CEA; 19q13.2; Streydio et al., 1990; Teglund et al., 
1994), eosinophil cationic protein (RNASE3; 14q24-q31; Zhang et al., 1998), the 
rhesus blood group genes (RHD, |pp34-p36.2; RHAG, 6p11-p21.1; Kitano et al., 
1998) and B-microseminoprotein (MSMB, 10q11.2; Nolet et al., 1991). Such 
examples of a significantly higher rate of nonsynonymous nucleotide substitution 
than synonymous substitution provide strong evidence for the action of positive 
selection. 

Lysozyme, originally a bacteriolytic enzyme whose origin preceded the emer- 
gence of the vertebrates, has been independently recruited as a digestive enzyme 
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by two groups of mammals, the ruminant artiodactyls and the leaf-eating colobine 
monkeys (Prager, 1996; Stewart et al., 1987). Both groups are able to ferment plant 
material in the foregut and possess stomachs that contain high levels of lysozyme. 
Advanced ruminants such as cows, sheep and deer possess ~10 lysozyme genes as 
a result of gene amplification events which occurred after the divergence from the 
pig lineage. Sequence comparison of the lysozyme genes from human (LYZ; chro- 
mosome 12) and other primates has indicated a k/k, ratio significantly less than 
unity in both the colobine and hominoid lineages (Messier and Stewart, 1997), 
indicative of the action of positive selection for amino acid replacements. 
Interestingly, several of the amino acid replacements noted in the colobine lin- 
eage also occurred in parallel or convergently in the ruminant artiodactyls 
(Messier and Stewart, 1997; Stewart et al., 1987; Swanson et al., 1991). Comparison 
of the nucleotide sequences of the coding regions of the lysozyme genes of 
advanced ruminants and pigs revealed no difference in the rate of synonymous 
substitution consistent with the view that it was a change in selective pressure 
rather than the mutation rate that was responsible for changes in the rate of stom- 
ach lysozyme evolution (Yu and Irwin, 1996). The reasons for positive selection 
for lysozyme amino acid replacements in hominoids are at present unclear but 
Messier and Stewart (1997) suggested that it might have been associated with the 
increased neutrophil expression of lysozyme in hominoids as compared with 
other catarrhines. 

Since synonymous substitutions are likely to be neutral with respect to selec- 
tion, they have been employed in numerous attempts to calibrate molecular clocks 
(Easteal et al., 1995 Fitch and Ayala, 1994; Li, 1997). However, substitution rates 
can vary quite widely between orthologous gene sequences in different taxonomic 
groups or lineages, as well as between different genes in the same species (Britten, 
1986; Easteal, 1988; Easteal and Collet, 1994; Gibbs and Dugaiczyk, 1994; Li et 
al., 1990; Li et al., 1996; Ohta and Ina, 1995). The speed of the molecular clock 
appears to vary according to the lineage. In order to estimate the relative rates of 
nucleotide substitutions in two lineages leading to extant species A and B, a rela- 
tive rate test is employed. This involves the use of a third (reference) species, C, 
which is known to have branched off prior to the divergence of A and B. Pairwise 
comparisons of orthologues in A and C, and in B and C, are used to calculate the 
k value, the number of substitutions per 100 sites. The k,c and k,, values then 
provide a measure of the relative rates of mutation in the lineages leading to 
species A and B, respectively. Such calculations have suggested that the substitu- 
tion rates in the lineages leading to mouse and rat are approximately equal (~7.9 
x 10° (3.9-11.8); Li and Graur, 1991; O’hUigin et al., 1992; Wolfe and Sharpe, 
1993) whereas comparable estimates for humans and Old World monkeys (~2.2 x 
10° (1.8-2.8)) and humans and chimpanzees [~1.3 x 10° (0.9-1.9)] are consider- 
ably lower. This is held to be indicative of a slowdown in the substitution rate in 
primates which appears to be at its greatest in hominoids (Bailey et al., 1991; 
Ellsworth et al., 1993; Gu and Li, 1992; Koop et al., 1986; Li and Graur, 1991; Li 
and Tanimura, 1987; Li et al., 1996; Seino et al., 1992). Despite these data, the 
hominoid slowdown is not a universal phenomenon in primate evolution since it is 
not apparent with some gene sequences (Easteal, 1991; Kawamura et al., 1991; 
Shaw et al., 1989). 
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Some genes/proteins show a rather different, discontinuous pattern of evolution. 
Thus, for example, in both primates and artiodactyls, the growth hormone (GH1; 
17q23) gene exhibits a pattern of near stasis punctuated by bursts of rapid evolution 
during which the rate of evolutionary change has increased at least 25-fold (Ohta, 
1993; Wallis, 1994, 1996). Interestingly, during mammalian evolution, only the cod- 
ing region of the growth hormone gene corresponding to the mature protein has 
exhibited rapid change whereas other regions of the gene (signal peptide, nonsyn- 
onymous substitutions and 5’ and 3’ untranslated regions) have remained relatively 
unchanged. This would be consistent with the rapid bursts of evolution being of 
adaptive significance and Wallis (1997) suggested that this may have involved what 
he termed function switching. Briefly, if primate GH were to have had biological func- 
tions other than growth promotion viz. lactogenic (prolactin-like) activity, GH 
could then also have played a role in maintaining the nutritional balance between 
mother and young. Acquisition of its lactogenic function could have involved 
changes in GH structure away from that best adapted to growth regulation to one 
that represented an evolutionary compromise which optimized the protein for its 
dual function. If the secondary function of GH had then changed (e.g. the adoption 
of a pattern of seasonal breeding following migration or climatic change, in which 
gestation and suckling no longer overlapped), selection for dual function would no 
longer have operated and the selection pressure on GH structure would have been 
related to its growth promoting function alone. If several amino acid substitutions 
had already occurred, then the process of functional reversal involving simply the 
reversal of each amino acid change would have been very unlikely. Rather, GH 
would have adopted a new structural form, slightly different from the original, but 
nevertheless adapted to its primary function. Switching back to dual function 
might then have led to a new cycle of structural change, driven by selection but 
without any overall change in function. This ‘pushme-pullyow’ mechanism of func- 
tion switching can in principle be resolved by gene duplication with subsequent 
divergence of the paralogues to perform different functions. Indeed, the gene dupli- 
cations giving rise to the GH cluster occurred after the rapid burst of evolution in 
primates. Once duplication had occurred, the rapid evolution of the pituitary 
expressed GH1 gene would probably have ceased although the relatively high evo- 
lutionary rate would still have been maintained by the placentally expressed genes. 
Whilst this idea is very appealing, it should be pointed out that the evolution of the 
placental lactogens early on in the mammalian radiation might reasonably be 
expected to have removed any selective pressure on the lactogenic properties of GH. 

Another example of the intermittent acceleration and deceleration of the 
nucleotide substitution rate is provided by the cytochrome c oxidase subunit IV 
(COX4; 16q22-qter) gene at different stages of primate evolution (Wu et al., 1997). 
In passing, it is interesting to note that conceptually, the idea of function switch- 
ing had its origin in the days of the early evolutionists: 


It is an error to imagine that evolution signifies a constant tendency to 
increased perfection. That process undoubtedly involves a constant remodel- 
ing of the organism in adaptation to new conditions; but it depends on the 
nature of those conditions whether the direction of the modifications effected 
shall be upward or downward. 

T.H. Huxley (1888) The Stuggle for Existence in Human Society 
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Intriguingly, Miyata et al. (1994) have claimed that the rate of evolutionary change 
of tissue-specific genes varies with the site of expression. Thus, brain-specific 
genes were reported to evolve at a slower rate than immune system-specific genes. 
However, since the identity of the compared genes was not revealed and the sam- 
ple size small, it is hard to comment on the validity or otherwise of the authors’ 
conclusions. 


7.1.4 Mutation rates and their evolution 


The fundamental importance of mutation for any account of evolution is clear. 
It enables us to escape from the impasse of the pure line. Selection within a 
pure line will only be ineffective until a mutation arises. 

J.B.S. Haldane The Causes of Evolution (1932) 


Although very much dependent upon the type of mutation being considered and 
the identity of the gene in question, studies of disease pathology have suggested 
that the mutation rate in males is higher than that in females (reviewed by Cooper 
and Krawczak, 1993). This has been held to reflect the rather higher number of 
cell divisions required from zygote to mature germ cell in the male as compared 
to the female. In an evolutionary context, Miyata et al. (1987) developed a method 
for estimating the male-to-female ratio of mutation rates (a,,) from rates of 
nucleotide substitution in sex-linked and autosomal sequences. They predicted 
that if o,, were very large, the rate of synonymous substitution in X-linked genes 
would be ~2/3 of that in autosomal genes on the basis that the X chromosome is 
twice as likely to be present in a female as in a male. This prediction was borne out 
by their analysis of human and rodent gene sequences. Further, the rate at which 
the Y chromosome accumulated substitutions was found to be twice that of the 
autosomes (the Y chromosome mutates at relative rate @, as mutation always 
occurs in a male). The comparison of rodent Ubel genes and pseudogenes allowed 
& „to be estimated to be of the order of 2.0 (Chang and Li, 1995). Similarly, com- 
parison of SMCX/SMCY (Xp11.21-p11.22/Yq) genes from mouse, horse and 
human yielded an estimate for a, of 3.0 (Agulnik et al., 1997). By contrast, com- 
parison of intronic sequences of the homologous ZFX (Xp21.3-p22.3) and ZFY 
(Yp11.32) loci in humans, orangutans, baboons and squirrel monkeys suggested 
that œ „ may be as high as 6.0 in primates (Shimmin et al., 1993). The actual value 
of o,, may however be very much dependent upon the species compared. Thus, f- 
globin pseudogene sequence data were used to derive estimates of a, between 3 
and 6 in higher primates but only ~2 in mice and rats (Li et al., 1996). If studies 
of many different systems indicate that the mutation rate is consistently higher in 
males than in females, then it may indeed be possible to view evolution as being 
‘male-driven’ (Hurst and Ellegren, 1998). 

That some sequences are hypermutable in the human genome is clear from 
studies of pathological lesions responsible for human genetic disease (Cooper and 
Krawczak, 1993; Krawczak et al., 1998). Hotspots for somatic mutation are also 
apparent from studies of mammalian immunoglobulin genes (consensus 
sequences, G C/T A/T and TAA; Rogozin and Kolchanov, 1992). Such hotspots 
can be interpreted in terms of nearest neighbor effects on nucleotide substitution 
rates (Krawczak et al., 1998; Section 7.5.1). In an evolutionary context, these 
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effects are also apparent from sequence comparison studies of human genes and 
pseudogenes (Blake and Hess, 1992; Hess et al., 1994). 

The mutation rate in higher organisms has long been assumed to be a compro- 
mise between keeping the frequency of deleterious substitutions low at the same 
time as not completely abolishing the potential for generating adaptive variation. 
One prediction of this trade-off theory is that a lower equilibrium mutation rate 
should evolve if the deleterious effect of mutation is increased. To examine the 
validity of this postulate, McVean and Hurst (1997) compared rates of synony- 
mous nucleotide substitution in 33 X-linked genes and 238 autosomal genes in 
mouse and rat. Since the X chromosome is hemizygous in male mammals, delete- 
rious recessive mutations arising on it might reasonably be expected to have a 
greater effect on fitness than those arising on the autosomes. If, however, synony- 
mous substitutions were completely neutral with respect to fitness, then they 
should accumulate at a rate equal to the mutation rate and could be used directly 
to estimate mutation rates. McVean and Hurst (1997) found that the X-linked 
genes exhibited significantly lower rates of synonymous substitution than their 
autosomal counterparts and this was held to be explicable in terms of the X-chro- 
mosomal mutation rate being reduced by natural selection. 

Mutational pressure may have influenced the evolution of the eukaryotic 
genetic code in that the code’s organization at least appears to provide at least 
some protection against the deleterious consequences of single base-pair substitu- 
tions (Goldman, 1993; Haig and Hurst, 1991; Jukes, 1993; Kuhn and Waser, 1994; 
Osawa and Jukes, 1988; see Section 7.2). However, the converse argument is more 
persuasive: that the design of the genetic code and the functional similarity (or 
dissimilarity) of amino acids to one another may have affected the relative muta- 
bilities of individual codons during evolution. Codon usage varies quite widely 
between different vertebrates (Nakamura et al., 1998; CUTG Database at 
http://www.dna.affre.go.jp/~nakamura/CUTG.html), a finding which may be 
related to the influence of the codon frequencies of highly expressed genes on 
translation efficiency via tRNA pools (Britten, 1993). Codon usage may also vary 
between different genes in the same species. In humans, such a finding has been 
suggested to be related to chromosomal location (D’Onofrio et al., 1991; see 
Section 7.4). 


7.1.5 The deleterious mutation rate in humans 


Man’s yesterday may ne’er be like his tomorrow; 
Nought may endure but mutability. 
Percy Bysshe Shelley (1816) Mutability 


It has been suggested that humans may experience a high deleterious mutation 
rate (Crow, 1997; Kondrashov and Crow, 1993). Ifa significant proportion of these 
mutations were even mildly deleterious, such lesions would tend to accumulate in 
populations with small effective sizes or in which selection had been relaxed. 
Eyre-Walker and Keightley (1999) estimated the human deleterious mutation rate 
per diploid genome per generation, U, by comparing the expected and observed 
rates of nonsynonymous substitution in 46 orthologous gene pairs from human 
and chimpanzee. Under conservative assumptions of 60 000 genes in the human 
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genome, an average length of protein coding sequence of 1.52 kb, a human-chim- 
panzee divergence time of 6 Myrs ago, and an average generation time of 25 years 
in the human lineage, Eyre-Walker and Keightley (1999) estimated the rate of 
nonsynonymous substitutions, M, to be 4.2 + 0.5 mutations per diploid genome 
per generation, and the deleterious mutation rate, U, to be 1.6 + 0.8 mutations per 
diploid genome per generation. Estimates of U for chimpanzee (1.7 + 0.8) and 
gorilla (1.2 + 0.6) were found to be similar to that obtained for humans. Eyre- 
Walker and Keightley (1999) concluded that the human deleterious mutation rate 
is close to the upper limit tolerable by a species with a low reproductive rate. This 
implies that in hominids, synergistic epistasis may have occurred between delete- 
rious mutations. Further, the level of selective constraint (U/M) in human protein 
coding sequences (0.38 + 0.17) was judged to be atypically low as were estimates 
from chimpanzee (0.53 + 0.16) and gorilla (0.38 + 0.17). A large number of 
slightly deleterious mutations may therefore have become fixed in the hominids 
and the most likely explanation for this, is small long-term effective population 
size. 


7.2 Mutations in pathology and evolution; two sides of the 
same coin 


Genotypes never have votes. Phenotypes sometimes do. 
A.L. Mackay 


In the early days of genetics, many thinkers saw spontaneous variation purely and 
simply in terms of its role as the evolutionary fuel for speciation. The first to draw 
parallels with disease was probably the British geneticist William Bateson who, at 
the turn of the century, maintained that disease presented ‘a discontinuity closely 
comparable with that of many variations’. Indeed, he speculated 


that the problem of species [may well] be solved by the study of pathology; for 
the likeness between variation and disease goes far to support the view which 
Virchow has forcibly expressed, that ‘every deviation from the type of the par- 
ent animal must have its foundation on a pathological accident’ 

Bateson (1894) 


Couched in modern terms, single base-pair substitutions in human gene pathol- 
ogy and evolution may be viewed as two sides of the same coin. This appealing 
supposition has recently been corroborated by an in-depth comparison of muta- 
tions causing inherited disease with mutations in noncoding DNA that have 
become fixed during the evolutionary divergence of human and other mam- 
malian genomes (Krawczak and Cooper, 1996a). Mutations in noncoding DNA 
may date back some millions of years and their survival has been independent of 
natural selection. By contrast, disease-associated substitutions are of fairly recent 
origin by comparison with the evolutionary timescale. This difference notwith- 
standing, under the assumption that the underlying molecular mechanisms of 
mutation have not changed substantially during mammalian evolution, some 
resemblance between the two mutational spectra was to be expected. Consistent 
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with this explanation, the relative base substitution rates observed in the context 
of human genetic disease (Cooper and Krawczak, 1993; Krawczak and Cooper, 
1996a) were found to be remarkably similar to those derived by Hess et al. (1994) 
in an extensive analysis of human gene/pseudogene alignments. This pattern of 
similarity was still apparent after allowance had been made for the DNA sequence 
environment at the site of mutation by the use of nearest neighbor-dependent 
mutation rates (Krawczak and Cooper, 1996a); the only notable difference was a 
slight under-representation among pseudogene mutations of C>T and GoA 
substitutions. This was held to be explicable in terms of either a lower level of 
germline methylation or, perhaps more likely, a deficiency of the 5mC-containing 
CG dinucleotides (‘CG suppression’; Bird, 1980) in non-coding DNA sequences. 

The claim that relative mutation rates exhibit a strong positive correlation 
between pathological mutations in coding sequences and evolutionarily fixed 
mutations in non-coding sequences (Krawczak and Cooper, 1996a) might appear 
surprising in the light of other experimental results. For example, a study of 
murine aprt gene sequences has claimed that silent mutations accumulate more 
slowly in transcribed sequences, possibly due to preferential DNA repair (Boulikas, 
1992; Turker et al., 1993). Further, Hanawalt (1990) has shown that in vitro-induced 
pyrimidine dimers and interstrand DNA crosslinks are repaired with a substan- 
tially higher efficiency in active genes than in noncoding regions. Although this 
type of lesion is specific to the action of particular exogenous chemical mutagens 
and irradiation, the idea of a system which is generally more effective at removing 
endogenous mutations from coding DNA as opposed to noncoding DNA is appeal- 
ing. This is because efficient DNA repair should only have conferred a substantial 
selective advantage in coding regions. By contrast, the results of Krawczak and 
Cooper (1996a) suggest that the relative contribution (via variable efficiency) of dif- 
ferent DNA repair pathways to the generation of mutations is unlikely to differ 
substantially between intragenic and intergenic sequences. 

To what extent is the likelihood of generation of a mutation related to its phe- 
notypic consequences? When codon substitutions causing genetic disease were 
categorised according to whether they were neutral or whether they changed the 
hydrophobicity or polarity of the encoded amino acid residue, it emerged that 
neutral changes were characterized by larger likelihoods of generation via muta- 
tion than nonneutral substitutions (Krawczak and Cooper, 1996a). This disparity 
suggested that selection has operated on the cellular DNA repair machinery in 
such a way as to optimize the removal of the latter type of mutation. If nonneutral 
changes were more likely to result in a disadvantageous (disease) phenotype than 
neutral substitutions, then any repair bias operating against these changes at the 
DNA level would have had a selective advantage. 

The hypothesis of a mutational repair bias was further supported by the finding 
that the likelihood of generation of an amino acid substitution in humans is neg- 
atively correlated with its likelihood of coming to clinical attention (Krawczak 
and Cooper, 1996a). The extent of the correlation was, however, found to differ 
dramatically between different types of substitution. Thus, a significant decrease 
in mutation generation likelihood was only associated with an increased likeli- 
hood of clinical observation for substitutions which affected hydrophilic and 
polar residues. Various explanations may be considered to account for these 
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findings. First, substitutions of hydrophilic and polar amino acid residues could 
have resulted in more severe and/or variable consequences for the phenotype than 
other types of substitution during most of the time period that the DNA repair 
mechanisms were evolving. Second, it has been observed that, during evolution, 
the effect of an amino acid change in the hydrophobic protein core is often com- 
pensated for by another change in the immediate vicinity (Schirmer, 1979). Thus, 
any selective pressure acting upon the organism to avoid substitutions at 
hydrophobic core residues may have been balanced by the requirement for evolu- 
tionary fixation of a second compensatory mutation. Finally, the majority of adap- 
tive events shaping the eukaryotic DNA repair process will have occurred in 
organisms other than human. It therefore follows that, were current clinical 
observation likelihoods to be specific to human, this could have obscured any cor- 
relation between mutation generation likelihoods and the phenotypic conse- 
quences of substitutions at hydrophobic or nonpolar residues. 

If mutations in human genes are biased against amino acid replacements with a 
high present day probability of resulting in a disease phenotype, the question arises 
as to whether such a bias might also be reflected in the evolutionary history of pro- 
tein sequences. To explore this possibility, Krawczak and Cooper (1996a) calculated 
a quantity termed the relative evolutionary acceptability for each possible amino acid 
substitution from the data of Collins and Jukes (1994) who had reported the num- 
bers and types of amino acid mismatches deduced from the alignment of 337 pairs 
of human-rodent cDNA sequences. Krawczak and Cooper (1996a) found that clini- 
cal observation likelihoods were negatively correlated with evolutionary acceptabil- 
ity values. This implied that the more likely a given amino acid substitution was to 
result in a disease phenotype (at least in contemporary humans), the less often has 
it been tolerated during the evolution of mammalian protein sequences. 

In summary, it is evident that the evolutionary requirement to avoid a deleteri- 
ous phenotype has left its footprints in the mechanisms of mutation generation. 
In this context, the most promising target for selection would appear to be the 
intracellular DNA repair mechanism. Although the effect of a given amino acid 
replacement upon protein structure is known to be heavily dependent upon its 
precise location within the tertiary structure of the molecule (Alber, 1989; Pakula 
and Sauer, 1989; Wacey et al., 1994), some basic rules which relate local causes and 
consequences may nevertheless be perceived (De Filippis et al., 1994). If the effi- 
ciency of mutation removal were directed by the immediate DNA sequence con- 
text of a lesion, it may be that this has facilitated the avoidance during evolution 
of hazardous amino acid replacements by consideration of the genetic code. 


7.3 The importance of evolutionary conservation in the 
study of pathological mutations at the protein level 
using human factor IX as a model 


The nature, frequency and location of gene lesions causing human genetic disease 
are highly specific and, as outlined in Chapter 1, section 1.5, are determined in 
part by the local DNA sequence environment. Once a given mutation has arisen, 
however, the likelihood that it will come to clinical attention is a complex 
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function of the nature of the resulting amino acid substitution, its precise location 
and immediate environment within the protein molecule, and its effects upon 
protein structure and function. Although in-depth investigations of the in vivo 
effects of missense mutations upon specific human proteins are generally rare, 
two such studies have nevertheless been performed for human factor IX (Bottema 
et al., 1991; Wacey et al., 1994), the liver-expressed zymogen ofa vitamin K-depen- 
dent serine protease that activates factor X in the presence of factor VIIa. These 
studies will be discussed in some detail. 

The vast majority of known lesions in the F9 (Xq26-q27) gene causing hemo- 
philia B are missense mutations (Giannelli et al., 1996), causing ~59% of severe 
(<1% FIX: C) and moderate (1-5% FIX: C) hemophilia, and perhaps as much as 
97% of mild (>5% FIX: C) hemophilia (Sommer et al., 1992). On the basis of 95 
independent missense mutations, Bottema et al. (1991) concluded that substitu- 
tions of ‘generic’ factor IX residues (conserved in factor IX of other mammals and 
in related human serine proteases) almost invariably cause hemophilia B. 
Mutations at factor [X-specific residues (conserved only in mammalian factor IX) 
and nonconserved residues were, by contrast, found to be some six to 33-fold less 
likely to result in a disease phenotype. Even though the study of Bottema et al. 
(1991) provided new insights into the identity of amino acid residues of structural 
or functional importance to factor IX, the authors did not employ models of the 
tertiary structure of the protein or its constituent domains. The significance of 
the location of specific amino acid residues within the structure of the factor IX 
molecule to the consequences of mutation could therefore not be assessed. 
Moreover, neither the variable propensity of different regions of the F9 gene to 
mutate nor the nature of the resulting amino acid exchanges were considered. 

In many ways, factor IX represents an ideal system in which to assess the influ- 
ence of positional determinants upon the disease-associated mutational spectrum 
of a single protein. Firstly, the number of known F9 missense mutations is among 
the highest of all human genes (Human Gene Mutation Database; 
http://www.uwem.ac.uk/uwcm/mg/hgmd0.html). Secondly, the amino acid 
sequences of numerous other vertebrate factor IX proteins and evolutionarily 
related serine proteases are available for direct comparison (Sarkar et al., 1990; 
Bottema et al., 1991). Finally, the structure of factor IX has been determined by X- 
ray crystallography (Brandstetter et al., 1995) and the three-dimensional struc- 
tures of a number of homologous serine proteases are also known. Wacey et al. 
(1994) constructed by comparative methods (Swindells and Thornton, 1991), a 
multidomain model of the quaternary structure of activated factor IX (FIXa) and 
used this model to study the expression pathway of F9 gene lesions from genotype 
to clinical phenotype: a total of 277 different single base-pair substitutions in the 
F9 gene, comprising 241 missense mutations and 36 nonsense mutations, were 
analysed. Comparison of the relative nearest neighbor-dependent single base-pair 
substitution rates in the F9 gene with estimates derived from a wide range of 
other human genes revealed similar profiles (with CpG dinucleotides represent- 
ing hotspots for mutation), suggesting that similar mutational mechanisms were 
operating at the DNA level. 

Wacey et al. (1994) classified F9 missense mutations as either conservative or 
nonconservative on the basis of the chemical difference between the wild-type and 
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the mutant amino acid residue. Chemical difference is a measure, originally devised 
by Grantham (1974), that combined the three interdependent properties of chem- 
ical composition, polarity and molecular volume of an amino acid residue. When 
the magnitude of biochemical change upon substitution was measured in this way 
and related to the clinical severity of the resulting disease phenotype, nonconser- 
vative substitutions (i.e. those characterized by large chemical differences) were 
found to result in severe rather than mild or moderate hemophilia B approxi- 
mately 1.7 times more often than conservative substitutions. Conversely, conserv- 
ative substitutions were 3- to 4-fold more likely to be associated with a moderate 
or mild phenotype than their nonconservative counterparts. The possibility was 
considered that conservative amino acid substitutions might be more likely to 
come to clinical attention in the tightly packed core of the protein as opposed to 
the surface of the molecule where they mght be more readily tolerated. However, 
since mutations from all chemical difference classes appeared to be scattered over 
all domains of the protein and since no one chemical difference class was found to 
be associated solely with surface or buried residues, there appeared to be no rela- 
tionship between the magnitude of the amino acid exchange and the location of 
the affected residue (Wacey et al., 1994). 

The extent of evolutionary sequence conservation exhibited by amino acid 
residues in factor IX was also found to correlate with disease severity. Whilst 71% 
of mutations at ‘highly conserved’ residues (residues conserved in all mammalian 
factor IX proteins and in three other serine proteases) caused severe rather than 
mild or moderate hemophilia B, this was the case for only 50% of mutations at less 
conserved residues. Furthermore, Wacey et al. (1994) estimated that missense 
mutations at non-conserved residues were 15-20 times less likely than mutations 
at conserved residues to result in a disease phenotype at all. Although this implies 
that many missense mutations at evolutionarily unconserved residues are toler- 
ated by the molecule and do not come to clinical attention, the relative impor- 
tance of such residues was considered to be greater than previously claimed by 
Bottema et al. (1991). Several explanations were suggested for this discrepancy. 
Firstly, the sample of mutations used by Wacey et al. (1994) was three times larger 
than that of Bottema et al. (1991). Secondly, Wacey et al. (1994) allowed for deter- 
minants neglected by Bottema et al. (1991) viz. the actual F9 gene coding sequence 
and the redundancy of the genetic code. Finally, the two studies were not directly 
comparable since Wacey et al. (1994) confined their estimation of clinical obser- 
vation likelihoods to severe cases of hemophilia B in order to cope with the prob- 
lem of identical-by-descent mutations which are likely to be more prevalent in 
cases of mild or moderate disease. 

Amino acid residues which are sequence conserved both in mammalian factor 
IX proteins and four different human serine proteases are likely to be critical for 
functions common to all serine proteases. These residues are located predomi- 
nantly in the interior of factor IX, within a-helices or B-turns, and Wacey et al. 
(1994) found that substitutions tended to cluster in the Gla domain, the EGF 
domains and the serine protease domain (around the reactive site and oxyanion 
hole). All but one of the Cys residues known to be involved in disulfide bonding 
were affected by mutation as were the reactive site residues, residues involved in 
carboxylase recognition and activation peptide cleavage site, and residues which 
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contribute to factor X binding. By contrast, mutations at factor IX residues which 
are sequence conserved in mammals and one or two other serine proteases were 
found to cluster in the serine protease domain and also at domain boundaries 
within the protein structure. Presumably, docking of the constituent domains of 
factor IX and the other homologous serine proteases during protein folding 
requires amino acid conservation at these boundaries. Finally, factor IX residues 
which are sequence conserved between mammals but not between other serine 
proteases may be exclusively important for the structure and function of factor 
IX. When the 3D structure of the factor IX protein was considered, spatially clus- 
tered groups of mutations at such residues became apparent on the surface of the 
EGF domains, in regions implicated in the binding of factors Va and VIIIa, and 
of the serine protease domain. 

Regions which exhibit similar functions in different homologous proteins may 
not only be sequence conserved but may also exhibit structural conservation 
(Greer, 1990). Although sequence conserved regions (SeqCRs) of mammalian serine 
proteases are invariably structurally conserved, structurally conserved regions 
(SCRs) may differ markedly with respect to their amino acid sequences. SCRs 
were defined by Greer (1990) as those portions of known protein structures that 
overlap very well when superimposed. In serine proteases, SCRs usually comprise 
secondary structure elements, the active site and other essential structural frame- 
work residues of the molecule. 

The locations of structurally conserved regions in human factor IXa were 
determined by Wacey et al. (1994) employing their homology model of the factor 
IX protein. No clear relationship was noted between the severity of the hemo- 
philia B phenotype and the level of structural conservation of a mutated factor IX 
amino acid residue, although substitutions at structurally conserved residues 
were estimated to have an approximately two-fold higher likelihood of resulting 
in a disease state than mutations at nonconserved residues. Interestingly, mutated 
sites which were not sequence-conserved were nearly all structurally conserved. 
The only exception involved two missense mutations at Gly59. However, Gly59 
lies immediately adjacent to a type B hairpin SCR and would be predicted to be 
critical in defining this structural element (Swindells and Thornton, 1991). Some 
SCRs, although not sequence-conserved, may thus serve as structural supports 
through their backbone interactions and should therefore be regarded as ‘scaf- 
folding’ residues rather than ‘spacers’ (Bottema et al., 1991). 

The topological properties of a mutated factor IX amino acid residue are also 
important for determining clinical severity. Mutations at residues with their side 
chains pointing away from the solvent (‘buried residues’) were found to cause 
severe hemophilia 1.5 times more often than mutations at residues with solvent- 
accessible side chains (Wacey et al., 1994). Finally, the likelihood of a mutation 
resulting in a severe disease phenotype was higher for substitutions in hydropho- 
bic as opposed to polar regions, probably because of the critical importance of 
these residues for correct protein folding (Kragelund et al., 1999). This is consis- 
tent with the conclusions of other workers that amino acid substitutions occur- 
ring in the protein core give rise to a ‘continuum of increasingly non-native 
properties’ affecting the stability and/or the folding dynamics of the protein 
(Alber, 1989; Lim et al., 1992; Pakula and Sauer, 1989). 
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7.4 Equilibrium of synonymous codon substitutions 


Single base-pair substitutions in coding regions that do not change the encoded 
amino acid sequence can be assumed to be comparatively free of selectional con- 
straints (Creighton, 1993). Although it cannot be entirely excluded that these syn- 
onymous (silent) changes might influence gene expression via effects on mRNA 
translation (e.g. via alterations in RNA secondary structure or local imbalances in 
the tRNA reservoir of a cell), the probability of survival and ultimate population 
fixation should in general be much higher for silent mutations than for missense 
mutations (Nei, 1987). This view is consistent with the fact that the vast majority 
of evolutionarily stable base substitutions in coding regions of human genes have 
taken place at the wobble positions of degenerate codons (Wilbur, 1985). 

There are 19 groups of triplets which encode the same amino acid such that the 
constituent triplets of each group can be replaced by each other via single base- 
pair substitution (see Zable 7.1; Krawczak and Cooper, 1996b). If mutations that 
cause an amino acid exchange are ignored, the mutation dynamics within each 
group of codons can be modelled by a simple system of linear equations involving 
the relative rates of different single base-pair substitutions. With evolutionary 
time, this system will approach an equilibrium state and the equilibrium codon 
frequencies within each group can be determined by solving this system of equa- 
tions. As can be inferred from Table 7.1, the actual frequencies within degenerate 
codons are still some distance from equilibrium in humans. 

When a similar analysis is performed for other vertebrate species with known 
codon usage (Wada et al., 1991), it turns out that humans are not the closest to 
their own equilibrium. In 17/19 cases, Xenopus laevis ranks first whereas humans 
and rodents form a second group of species, all ranking equally low (Krawczak 


Table 7.1. Euclidean distance between the vectors of current and equilibrium frequencies 
within degenerate codons of human genes (from Krawczak and Cooper, 1996b) 


Encoded Euclidean 
amino acid Codon group distance 
Glu GAA GAG 0.083 
Lys AAA AAG 0.088 
Asp GAT GAC 0.164 
Asn AAT AAC 0.173 
Tyr TAT TAC 0.196 
His CAT CAC 0.222 
Pro CCT CCC CCA CCG 0.240 
Thr ACT ACC ACA ACG 0.275 
Gin CAA CAG 0.294 
Ser TCT TCC TCA TCG 0.295 
Cys TGT TGC 0.311 
Gly GGT GGC GGA GGG 0.314 
Ala GCT GCC GCA GCG 0.318 
Phe TTT TTC 0.367 
Arg CGT CGC CGA CGG AGA AGG 0.371 
Ser AGT AGC 0.416 
Leu TTA TTG CTT CTC CTA CTG 0.431 
Val GTT GTC GTA GTG 0.444 


lle ATT ATC ATA 0.540 
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and Cooper, 1996b). This class is followed by chicken, dog, cow, pig, rabbit and 
sheep in order of distance from equilibrium. There is thus some correlation 
between distance from equilibrium and generation time and, with the exception 
of humans, the ranking of species reflects the total synonymous substitution rates 
estimated by Bulmer et al. (1991). These data are therefore consistent with a 
model of DNA sequence evolution which, after species divergence, allows ances- 
tral gene sequences to approach equilibrium codon usage faster in one species 
than in another. 

If current codon usage in different species were indeed the result of the diver- 
gent evolution of common ancestral sequences progressing at different absolute 
(albeit equal relative) substitution rates, then one species should always be closer 
to equilibrium than the other in all codon groups. This, however, is not the case. 
For example, hamster is closer to equilibrium than dog for the tyrosine (TAT, 
TAC) and histidine (CAT, CAC) encoding triplets, but not for glutamine (CAA, 
CAG). Furthermore, the codon frequencies for lysine (AAA, AAG), aspartic acid 
(GAT, GAC) and glutamic acid (GAA, GAG) in Xenopus are on opposite sites of 
the equilibrium when compared to all other vertebrates (Krawczak and Cooper, 
1996b). 

One may therefore surmise that codon usage has not evolved in a strictly uni- 
form way. Although it cannot be excluded that relative mutation rates differ 
between species thereby resulting in different equilibria, the fact that Xenopus and 
rodents are very close to an equilibrium which is itself based upon human genetic 
disease data, argues strongly against this objection. It is thus more likely that 
species divergence has been accompanied by substantial changes in codon usage, 
allowing some species to manifest sudden changes with respect to their distance 
from equilibrium. This would also be consistent with the finding that differences 
in synonymous codon usage between vertebrates is not explicable merely by dif- 
ferential absolute DNA repair efficiency (Eyre-Walker, 1994). 

Ikemura and Wada (1991) were able to demonstrate through an analysis of 
approximately 2000 human gene sequences that codon usage differs dramatically 
between different genomic regions. The major proportion of GC-rich genes was 
observed in T-bands whereas AT-rich genes were located mainly in G-bands. 
Further, the average G+C percentage at the third position of codons was found to 
be related to the quinacrine dullness and the mitotic chiasma density of a partic- 
ular chromosome. Since species divergence has almost always been characterized 
by gross structural chromosomal rearrangements, and since different genes are 
known to evolve at different absolute rates even in one and the same species 
(Bulmer et al., 1991), the piece-wise reconstitution of new genomes from their 
common ancestors is likely to have altered codon frequencies substantially. 


7.5 Single base-pair substitutions in gene regions 


The smallest changes that add to or subtract from a part in the smallest mea- 
surable degree may also arise by mutation. We identify these smaller muta- 
tional changes as the most probable variants that make a theory of evolution 
possible both because they do transcend the original types, and because they 
are inherited. 

T. H. Morgan (1925) Evolution and Genetics 
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7.5.1 Neighboring-nucleotide effects on the rate of germline single 
base-pair substitutions in human genes 


In terms of their relative frequency of occurrence, the most important category of 
single base-pair substitution causing human genetic disease is represented by C>T 
and G—A transitions within CpG dinucleotides; some 23% of all pathological sin- 
gle base-pair substitutions found within the coding regions of human genes are of 
this type (Krawczak et al., 1998). Allowing for the confounding effects of codon 
usage and differential clinical observation likelihoods through consideration of 
relative mutabilities, this proportion translates into a mean transition rate for 
either CG>TG or CG>CA that is five times higher than the base mutation rate. 

Krawczak et al. (1998) demonstrated that the proportion of pathological 
CG-(TG,CA) transitions is significantly higher for autosomal genes (25.0%) than 
for X-linked genes (17.7%). These proportions are a direct reflection of the signif- 
icantly lower frequency of CpG in the coding sequences of X-linked genes (2.9%) 
as compared to autosomal genes (3.7%). Krawczak et al. (1998) speculated that the 
lower CpG frequency in X-chromosomal genes may be a consequence of a gener- 
ally increased level of DNA methylation resulting from the evolutionary recruit- 
ment of this post-synthetic modification to play a role in X-inactivation 
(Hornstra and Yang, 1994; Jamieson et al., 1996). 

For CpG dinucleotides to be hypermutable in the context either of genetic dis- 
ease or evolution, they must be methylated in the germline (El-Maarri et al., 
1998). Since it cannot be excluded that the efficiency of both DNA methyltrans- 
ferase action (Smith, 1994; Smith and Baker, 1997) and G: T mismatch repair 
(Sibghat-Ullah and Day, 1993) may be influenced by sequence motifs flanking the 
CpG dinucleotide, the question arises as to whether some CpGs may be intrinsi- 
cally more mutable than others by virtue of their DNA sequence context. 
Significant differences in the relative mutation rate of CpG dinucleotides depend- 
ing upon their flanking nucleotides were indeed noted by Krawczak et al. (1998). 
These results were consistent with those of Ollila et al. (1996) who noted a prefer- 
ence for 5’ pyrimidines and 3’ purines flanking mutated CpG dinucleotides. 

Comparison of the sequences flanking the human and chimpanzee ßB-globin 
genes has shown that the CpG dinucleotide is hypermutable over evolutionary 
time and subject to high frequency C>T or G>A transitions (Savatier et al., 
1985). Indeed, some 40% of the CpG dinucleotides present in either the human or 
chimpanzee sequences were found to be affected by nucleotide sequence changes. 
Similar conclusions have been drawn by Perrin-Pecontal et al. (1992). Comparison 
of the CpG mutation rates exhibited by globin gene and pseudogene sequences 
from human, chimpanzee and macaque yielded an estimate of the rate of 5mC 
deamination of ~1 x 10-16 (Cooper and Krawczak, 1989). The absence of any sig- 
nificant difference between deamination rate estimates derived from gene and 
pseudogene sequence data suggested that the action of selection has had a negli- 
gible effect on the transition rate at CpG dinucleotides in primate B-globin genes. 
Indeed, this constancy in the CpG deamination rate is consistent with a neutral- 
ist view of gene evolution. Moreover, the successful use of evolutionary compar- 
isons of DNA sequences to derive consistent values of the CpG deamination rate 
has demonstrated the feasibility of using the CpG deamination rate as a ‘molecu- 
lar clock’, at least over relatively short periods of evolutionary time. 
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Although the nonrandomness of mutation is at its most dramatic for CpG 
mutations, non-CpG mutations also appear to be distributed nonrandomly. A 
subtle neighboring nucleotide effect reminiscent of misalignment models of 
mutagenesis was noted by Krawczak et al. (1998) in their study of pathological 
mutations; a substantial proportion of the observed single base-pair substitutions 
exhibited identity of the newly introduced base to one of the bases immediately 
flanking the site of mutation. Since this effect occurred only at a distance of one 
base-pair and in the absence of surrounding repetitive sequences, it was poten- 
tially explicable in terms of misalignment mutagenesis involving highly localized 
DNA slippage, misincorporation and realignment events between template and 
primer at the replication fork (Kunkel, 1990, 1992). Intriguingly, this next- 
neighbor mutational bias only occurred at specific codon positions and exhibited 
polarity. Since such a phenomenon is unlikely to be explicable in terms of the 
primary mutational event, it is probably associated with the DNA repair process. 
Implicit in this assumption, however, is that the DNA repair machinery is able to 
recognize the reading frame and to utilize this information as a cue in effecting 
DNA repair. Consistent with such a relationship, the observed correction bias 
would operate in such a way as to remove newly introduced termination codons. 
The ability of a DNA repair mechanism to take the reading frame into account, 
thereby minimizing the effects of mutation, would have had positive selective 
value owing to the relatively deleterious nature of in-frame termination codons. 
Evidence for reading frame sensitivity in the DNA repair process first came from 
the observation that relative single base-pair susbtitution rates are biased toward 
the avoidance of those replacements that (i) change the chemical characteristics 
of the encoded amino acid residue substantially and (ii) have a high likelihood of 
coming to clinical attention (Krawczak and Cooper, 1996a; Section 7.2). 
Krawczak et al. (1998) concluded that, by consideration of the genetic code, selec- 
tion has optimized the DNA repair mechanism in such a way as to avoid the most 
hazardous of amino acid replacements, a category which would certainly include 
nonsense mutations. 

Although the same substitution types appear to be subject to next neighbor- 
effects on the coding and noncoding DNA strands, the quantitative differences in 
non-CpG single base-pair substitution rates observed by Krawczak et al. (1998) 
confirmed that the two DNA strands are not fully equivalent in terms of their 
rates and patterns of mutation (Wu and Maeda, 1987). Nucleotide substitutions 
that have accumulated in noncoding sequences during evolution are also asym- 
metric between DNA strands (Francino and Ochman, 1997; Maeda et al., 1988). 
There are several possible (and nonmutually exclusive) reasons for this strand 
asymmetry. Firstly, the four nuclear DNA polymerases, each associated with its 
own distinctive mutational spectrum, may be differentially involved in the syn- 
thesis of the leading and lagging strands during DNA replication (Bambara et al., 
1997; Kunkel, 1992). Secondly, since the transcriptional elongation complex is 
asymmetric (Kainz and Roberts, 1992), mutation rates may differ between 
transcribed and non-transcribed strands on account of either unequal exposure 
to DNA damage or differential repair. Not only may the transiently single- 
stranded nontranscribed DNA strand be particularly vulnerable to mutation 
(e.g. by methylation-mediated deamination; Beletskii and Bhagwat, 1996) but 
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transcription-coupled repair (Bhatia et al., 1996; Drapkin et al., 1994; Hanawalt, 
1994), a process that corrects lesions specifically on the transcribed DNA strand, 
could also account for different mutation rates between transcribed and nontran- 
scribed strands. Both these mechanisms would predict a higher mutation rate for 
the nontranscribed as opposed to the transcribed DNA strand. 

Finally, in their dataset of pathological substitutions, Krawczak et al. (1998) 
found that thermodynamic stability of DNA triplets was positively correlated 
with the average relative rate at which the central nucleotide of a triplet under- 
went substitution. At first sight, this finding might appear to be counterintuitive 
since it implies that higher rather than lower DNA duplex stability would render 
a gene region more prone to single base-pair substitution. However, the consistent 
absence of flanking repeat elements noted for the mutations analyzed here sug- 
gests that extensive strand slippage (which would require the DNA to be single- 
stranded) is unlikely to play an important role in the generation of single 
base-pair substitutions. Nevertheless, a high degree of thermodynamic stability 
could in principle impair DNA replication in various ways. First, the likelihood 
that DNA helicases would be incapable of unwinding the two DNA strands cor- 
rectly or efficiently may be expected to be higher in more stable regions (Chen et 
al. 1992). Second, temporary reannealing of the two native DNA strands during 
replication might be favored and could be more enduring in such regions. In both 
cases, DNA polymerase activity would be seriously impeded by localized double- 
stranded DNA structures, which could result in either a cessation of polymeriza- 
tion or the skipping of one or more nucleotides, leaving a gap in the nascent DNA 
strand. Miscorrection during the post-replicative repair of such nicks would then 
introduce a single base-pair substitution. Alternatively, the observed correlation 
could reflect the increased stability of at least some slippage-mediated misalign- 
ments during replication of the native and nascent DNA strands, allowing 
enough time for misincorporation of a noncomplementary nucleotide. In this 
case, however, the thermodynamic stabilities of the misaligned structures must be 
comparable to those of the wild-type triplets, an assumption for which there is 
currently no evidence. 


7.5.2 Single base-pair substitutions in evolution which have altered the 
function of specific amino acid residues 


Natural selection is a mechanism for generating an exceedingly high degree of 
improbability. 
R.A. Fisher 


Some single base-pair substitutions occurring during gene evolution have intro- 
duced nonsense mutations into protein coding regions thereby either prema- 
turely truncating the protein product (Chapter 8, section 8.7) or abolishing the 
expression of that product altogether (Chapter 6, section 6.2). However, other 
single base-pair substitutions have introduced missense mutations that have 
served to alter the function of specific amino acid residues and these are the topic 
of this Section. 

Nucleotide substitutions that have occurred and become fixed through 
the action of genetic drift are readily apparent in any comparative analysis of 
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orthologous genes and provide abundant evidence for a neutralist model of evolu- 
tion. Nucleotide substitutions resulting in human genetic disease (Cooper and 
Krawczak, 1993) may be taken as representative examples of negative (purifying) 
selection since many of the lesions that have come to clinical attention are likely 
to reduce (or would once have reduced) survival and/or reproductive fitness. By 
contrast, unequivocal examples of the positive Darwinian selection of 
nucleotide/amino acid substitutions in higher eukaryotes are much rarer. Indeed, 
there are relatively few examples of single base-pair substitutions that have 
occurred within the coding regions of mammalian genes during evolution which 
have been sufficiently well characterized for us to be able to identify the conse- 
quent change in protein structure and function that has been subject to positive 
selection. Several illustrative examples are discussed below. Since these systems 
have not so far been fully characterized, further studies are eagerly awaited. 


The visual pigments. The visual pigments, which comprise an integral mem- 
brane protein (opsin) coupled to a light-sensitive chromophore, are spectrally 
tuned to a particular wavelength of maximal absorption, Amax. The visual pig- 
ments in rods (photoreceptor cells that function in dim light) are rhodopsins 
which have a Amax of 495 nm. In cones, the photoreceptor cells that mediate 
colour vision, there are three types of visual pigment which in humans have Amax 
values of 420 nm (blue/short wavelength-sensitive), 530 nm (green/middle wave- 
length-sensitive) and 560 nm (red/long wavelength-sensitive). The human genes 
encoding these opsins are: rhodopsin (RHO; 3q21-q24), the blue cone pigment 
(BCP; 7q31-q35), the green cone pigment (GCP; Xq28) and red cone pigment 
(RCP; Xq28). These genes have evolved by a process of duplication and diver- 
gence from a common ancestor ~500 Myrs ago (Yokoyama, 1997) and encode pro- 
teins that harbour specific amino acid changes that are directly responsible for the 
shifts in Amax values between the different visual pigments. 

To understand the molecular basis of spectral tuning of visual pigments, it is 
necessary to correlate the sequences of the visual pigments with their (max val- 
ues. Such an analysis was first performed by Yokoyama and Yokoyama (1990) who 
compared the red and green visual pigments from human and the Mexican cave- 
fish, Astyanax fasciatus. The red pigments in humans and cavefish were found to 
have evolved from the green pigments independently by three specific amino acid 
changes (Alal80Ser, Phe277Tyr and Ala285Thr, i.e. AFA SYT, Figure 7.2). A sim- 
ilar study in primates (Neitz et al., 1991) also concluded that the spectral differ- 
ence between red and green pigments could be accounted for by the difference 
between AFA (green) and SYT (red). These three critical amino acid residues are 
located near the chromophore and their experimental substitution has been 
shown to alter Amax values not only in human red and green pigments (Asenjo et 
al., 1994) but also in bovine rhodopsin (Chan et al., 1992). 

Most mammals have dichromatic vision, possessing blue visual pigments 
together with either red or green pigments (Jacobs, 1993). Phylogenetic analysis 
of a range of vertebrates has suggested that the vertebrate ancestor of the opsin 
gene was a green pigment gene encoding AFA at the three critical sites and that 
the common ancestor of tetrapods acquired a red pigment by two amino acid sub- 
stitutions (Phe277Tyr and Ala285Thr; Figure 7.2). The SYT of the red pigments 
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Figure 7.2. Phylogenetic relationships between vertebrate opsins (after Yokoyama, 1997) 
showing amino acids inferred at sites 180, 277, and 285. The bold and outlined letters 
refer to red pigment and green pigment characteristic amino acids respectively. Probable 
amino acid substitutions at different evolutionary stages are boxed. 


of extant tetrapods was acquired by a further amino acid substitution, Alal80Ser 
(Figure 7.2). 

Old World monkeys have both red and green pigments and therefore possess 
trichromatic vision. Phylogenetic analysis of the primate opsins (Nei et al., 1997) 
has indicated that the common ancestor of hominoids, Old World monkeys and 
New World monkeys possessed a red visual pigment with AYT at the three criti- 
cal sites (Figure 7.2) but no green pigment. The ancestor of the green pigment 
gene is thought to have arisen by gene duplication ~35 Myrs ago in the Old 
World monkey lineage (Nathans et al., 1986) and the green pigment was then 
derived in two distinct steps (AFA>AYT-—AFA; Figure 7.2). 

New World monkeys have only one X-linked opsin gene but exhibit multiple 
alleles at this locus (Jacobs et al., 1996; Neitz et al., 1991). Thus, in the squirrel 
monkey (Saimiri sciureus), the visual pigments derived from three alternative alle- 
les have Amax values of 532, 547, and 561 nm and manifest amino acids AFA, 
AFT, and SYT respectively at the critical sites (Neitz et al., 1991). Whilst all male 
and homozygous female squirrel monkeys are dichromatic, heterozygous females 
that happen to have pigments with Amax values of 532 nm and 561 nm possess 
trichromatic vision. Tovee (1994) suggested that color vision in New World mon- 
keys might be an adaptation to allow a wide variety of color vision types within a 
single family group. However, an alternative explanation, which does not invoke 
group selection, would be for the alternative alleles to be maintained in the popu- 
lation by overdominant selection. This postulate appears to be supported by the 
independent evolution of triallelic systems in several other New World monkey 
lineages represented by the marmoset (Callithrix jacchus), the saki monkey 
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(Pithecia irrorata), the capuchin (Cebus nigrivittatus) and the tamarin (Saguinus mys- 
tax) (Boissinot et al., 1998; Shyue et al., 1995). 

Zhou et al. (1997) studied the X-linked opsin gene of two nocturnal prosimians, 
the bushbaby species Galago senegalensis and Otolemur garnettu. At those amino 
acid positions known to cause spectral differences, however, the cone pigment 
possessed identical residues to those of the marmoset protein. This suggests that, 
in spite of the bushbaby’s nocturnal existence, its X-linked opsin gene is under 
functional constraint. Consistent with this view, Zhou et al. (1997) noted two 
amino acid substitutions which may be important for maximizing dim light sen- 
sitivity. 


The kringle domains of apolipoprotein(a). The human apolipoprotein(a) 
(LPA; 6q27) gene emerged from the plasminogen (PLG; 6q27) gene by a process 
of gene duplication followed by intragenic amplification of exons within the LPA 
gene (Chapter 8, section 8.6). Apolipoprotein(a) is a component of the cholesteryl 
ester-rich particle lipoprotein(a) within which it is covalently linked to 
apolipoprotein B100. Lipoprotein(a) has been postulated to play a role in fibri- 
nolysis by competing with plasminogen for binding to fibrin, thereby interfering 
with clot lysis. Fibrin binding appears to be potentiated by lysine-binding sites in 
some of the many kringle domains of apolipoprotein(a). Kringle IV-10 of human 
apolipoprotein(a) most closely resembles that of plasminogen kringle IV and 
appears to be critical for the fibrin-binding potential of lipoprotein(a). The lysine- 
binding sites of apolipoprotein(a) consist of a hydrophobic trough containing 
three aromatic amino acids (Trp62, Phe64, and Trp72), an anionic centre com- 
posed of two aspartic acid residues (Asp55 and Asp57), and a cationic centre com- 
prising two residues, Lys35 and Arg71. Kringles IV-1, IV-2, IV-3, and IV-4 of 
apolipoprotein(a) do not bind to lysine and in each case, this is associated with the 
absence of Asp57 within the kringle. Two different substitutions in kringle IV-10 
have occurred during primate evolution which are associated with the loss of the 
lysine-binding properties of the lipoprotein(a) particle; a Trp72—>Arg substitu- 
tion in rhesus monkey (Scanu et al., 1993; Tomlinson et al., 1989) and an 
Asp57—Asn substitution in chimpanzee (Chenivesse et al., 1998). The physiolog- 
ical consequences of these two substitutions are however as yet unknown. 


The DNA-binding specificity of steroid receptors. Gene regulation by steroid 
hormones is mediated by binding of the hormone ligand to its cognate receptor 
(Chapter 4, section 4.2.3, Nuclear receptor genes). Upon ligand binding, most nuclear 
receptors then interact as dimers with their response elements, each monomer 
binding to a half-site sequence. These response elements comprise two 6 bp half- 
site sequences organized as palindromic repeats with 3-5 bp separating the half 
sites. Thus the thyroid hormone, retinoic acid and estrogen receptors recognize 
the half-site sequence TGACCT whereas the androgen, progesterone and gluco- 
corticoid receptors recognize the half-site sequence TGTTCT. More specifically, 
the consensus estrogen response element (ERE) is AGGTCANNNTGACCT 
whilst the consensus glucocorticoid response element (GRE) is GGTACANNNT- 
GTTCT (Zilliacus et al., 1995). These response elements exhibit a 3 bp spacer 
between half-sites. By contrast, the thyroid hormone receptor recognizes a 4 bp 
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spacer and the retinoic acid receptor a 5 bp spacer. Thus, both the sequence and the 
spacing between the half-sites determines the specificity. 

Most nuclear receptors have two or three amino acids (P box residues) in their 
DNA-binding domains that serve to specify recognition of the response element 
within the promoter of the target gene. Some residues (e.g. Val443 in the gluco- 
corticoid receptor or Glu439 in the estrogen receptor) can contribute to specificity 
both by forming a positive interaction with a base in the cognate response element 
and by forming a negative interaction with a base in a noncognate response ele- 
ment (Zilliacus et al., 1994, 1995). In the glucocorticoid receptor, Ser440 inhibits 
the interaction of the receptor with the ERE but at the cost of also reducing affin- 
ity for the GRE (Zilliacus et al., 1994, 1995). Thus, the diversification of steroid 
receptor specificity was probably achieved during evolution by a relatively small 
number of single base-pair substitutions either in the P-box encoding residues of 
steroid receptors or in the response elements of their target genes. 


The calcium-dependent and -independent synaptotagmins. Synaptotagmins 
constitute a large family of proteins involved in membrane trafficking. At least 
five different synaptotagmin genes have been characterized in the human genome 
(SYT1, 12cen-q21; SYT2, 1q; SYT3, 19q; SYT4, 5q; SYTS, 11p). Most synapto- 
tagmins are capable of binding calcium through their calcium-binding C,A 
domains. Synaptotagmins IV and XI are unique in their inability to bind calcium 
and Von Poser et al. (1997) have shown that this inability is caused by the substi- 
tution of Ser for Asp at residue 230 in the C,A domain. This substitution is evo- 
lutionarily conserved in synaptotagmin IV and it seems likely that 
synaptotagmins IV and XI shared a common ancestor in which the substitution 
originated. Von Poser et al. (1997) postulated that evolution selected for the loss of 
calcium binding in two different synaptotagmins while leaving the remainder of 
the protein structures intact. This would of course imply that these synaptotag- 
mins also possess calcium-independent properties and that these properties have 
been retained by selection through evolutionary time. 


Olfactory receptor-ligand interactions. Certain residues in the o-helical sixth 
transmembrane domain of human olfactory receptors have been implicated in 
interactions with their odorant ligands and these residues appear to have been 
subject to positive selection (Singer et al., 1996). The highly variable residues 6-22 
and 6-25 are thought to constitute a receptor sub-site that binds hydroxyl groups 
on odorant ligands; substitutions at these critical positions may therefore help to 
determine the odor specificity of olfactory receptor sub-types. A Val/Tle difference 
at residue 206 of the orthologous rat and mouse I7 olfactory receptor proteins has 
been shown to be responsible for a species-specific odorant response; rats prefer- 
ing octanal to heptanal, mice the reverse (Krautwurst et al., 1998). 


7.5.3 Single base-pair substitutions in evolution which have affected 
MRNA splicing 


Single base-pair substitutions affecting mRNA splicing make up between 8% and 
15% of all mutations causing human genetic disease (Krawczak et al., 1992). These 
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lesions occur disproportionately at the most evolutionarily conserved positions 
within the splice site and fall into three main categories: (1) mutations within 5’ 
or 3’ splice sites which reduce the amount of correctly processed mature RNA 
and/or activate alternative (‘cryptic’) splice sites in the vicinity, (ii) mutations out- 
with actual splice sites which create cryptic splice sites, and (iii) mutations in the 
branch-point sequence (Krawczak et al., 1992). The vast majority of the patholog- 
ical lesions affecting mRNA splicing so far reported have been single base-pair 
substitutions within splice sites. This is not only because these are comparatively 
frequent but also because they are both readily detectable and highly likely to 
result in a severe clinical phenotype. Disease-associated mutations affecting 5’ 
splice sites are approximately twice as frequent as mutations at 3’ splice sites 
(Krawczak et al., 1992). This discrepancy coincides with a much higher level of 
sequence conservation at 5’ splice sites and is likely to reflect the strong require- 
ment for Ul snRNA binding at 5’ splice sites to promote alignment and cleavage. 
Regarding the phenotypic consequences of pathological mutations affecting 
mRNA splicing, the exclusion of one or more exons from the end-product (exon 
skipping) is observed more frequently than cryptic splice site utilization (Krawczak et 
al., 1992). Some evidence exists, at least for 5’ splice sites, that cryptic splice site 
usage is favored under conditions where a number of potential sites are present in 
the vicinity of the mutated splice site and where these potential splice sites exhibit 
sufficient homology to the consensus sequence (Krawczak et al., 1992). In most 
such cases, the activating mutation improves the similarity between the cryptic 
site and the splice site consensus sequence. At 3’ splice sites, the amount of 
mRNA product consequent to the utilization of the cryptic splice site appears to 
be correlated with the level of similarity to the consensus sequences; at 5’ splice 
sites, the distance to the nearest wild-type splice site may also play a role 
(Krawczak et al., 1992). 


Splicing mutations and gene inactivation. The alteration of mRNA processing 
as a result of mutations in splice sites also occurs during evolution. One dramatic 
example is that of the once active human L-gulono-y-lactone oxidase (GULOP) 
gene (chromosome 8p21.1) which was inactivated at least in part by the introduc- 
tion of single base-pair substitutions at the invariant bases of splice sites 
(Nishikimi et al., 1994; see Chapter 6, section 6.2.4). Another similar example is 
provided by the CSHLI1 gene, long regarded as a pseudogene of the human pla- 
centally expressed growth hormone/chorionic somatomammotropin gene family 
which is clustered on chromosome 17q22-q24. The CSHLI ‘pseudogene’ was 
originally thought to have been inactivated by the introduction of a G>A substi- 
tution in the first position of the intron 2 donor splice site (Press et al., 1994). 
However, in vitro expression studies have shown that it may only have been par- 
tially inactivated (Misra-Press et al., 1994). Although five alternatively spliced 
forms of CSHLI mRNA are produced from the ‘pseudogene,’ the majority of 
these transcripts lack exon 2 which encodes the signal peptide necessary for nor- 
mal secretion. Some low abundance CSHLI mRNAs are nevertheless produced 
which possess the leader peptide and these could in principle encode novel hor- 
mones of physiological significance. 
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Splice site differences between orthologous genes. In an evolutionary 
context, gene inactivation is likely to be a relatively rare occurrence and alter- 
ations in splice junction sequences are more often to be found associated with 
more subtle changes in mRNA processing such as alternative splicing. A number 
of examples have been noted in orthologous genes (Chapter 3, section 3.2). Thus, 
in the erythroid 5-aminolevulinate synthase (ALAS/; 3p21.1) gene, exon 4 is 
involved in alternative splicing in the human but not in the dog or mouse 
(Conboy et al., 1992). Conversely, the alternative splicing of the 45 bp exon 3 of the 
murine Alas] gene utilizes a major upstream splice site (85% of mRNAs) and a 
minor downstream site (15% of mRNAs). This is not found in human owing to an 
AG transition which abolishes the consensus sequence of the 3’ splice site 
thereby preventing the possibility of alternative splicing (Conboy et al., 1992). 
Finally, the alternative splicing pathway involving the mutually exclusive exons 
6A and 6B of the B-tropomyosin (TPM2; 9q13) gene differs between chicken, rat 
and Xenopus (Pret and Fiszman, 1996). The chicken B-tropomyosin exon 6A is 
flanked by stronger splicing signals than its rat counterpart and this has been 
related to inter-specific differences in both the donor and acceptor splice sites 
(Pret and Fiszman, 1996). A chicken-specific pyrimidine-rich splicing enhancer 
present upstream of exon 6A may also play a role. 


Splice site differences between paralogous genes. Evidence for splice site 
mutations having occurred during evolution has also come from the study of par- 
alogous genes. The human glycophorin B (GYPB; 4q28.2-q31) gene lacks exon 3 
by comparison with the related glycophorin A (GYPA; 4q28.2-q31) gene owing to 
a G>T transversion at the exon 3 donor splice site (Kudo and Fukuda, 1989). 
Comparison with the GYPB gene has identified an AT substitution in the first 
base of exon 5 of the GYPA gene which leads to the alternative use of an acceptor 
splice site 9 bp upstream and the incorporation of nine extra bases into the GYPA 
coding sequence (Kudo and Fukuda, 1989). 

The human genes encoding interleukin-la (ILIA; 2q13-q21), interleukin-1B 
(LIB: 2q13-q21) and interleukin-1 receptor antagonist JLIRN; 2q14.2) are evo- 
lutionarily related members of the interleukin family which remain closely linked 
on the long arm of chromosome 2. The first exon of ILIRN (encoding a leader 
peptide) is homologous to the untranslated first exon of ILIB but the ILIRN gene 
lacks the exons corresponding to the first three expressed exons of the JLIA and 
ILIB genes. Hughes (1994) suggested that the common ancestor of the ILIB and 
ILIRN genes was an alternatively spliced gene: one transcript could have 
included exons 1-7 encoding the ancestral IL1B protein whereas the other tran- 
script may have included exons 1 and 5-7 encoding the ILIRN protein. The 
duplication of this ancestral gene would have freed one copy from functional con- 
straints so that it could encode ILIRN only. Selection would no longer have been 
able to conserve the intron-exon junctions involved in the splicing of exons 2—4 of 
the JLIRN gene. Consistent with this view of events, the region of the ILIB gene 
between exons 1 and 5 is more than twice the length of the analogous region of the 
ILIRN gene. 

Finally, the pituitary-expressed growth hormone (GH1; 17q22-q24) gene differs 
from the closely linked placentally-expressed growth hormone (GHZ) gene in its 
pattern of splice site selection. Whereas 9% of GH] mRNA transcripts contain 
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exon 2 spliced to an alternative acceptor site located 45 bp into exon 3, the GH2 
gene does not utilize this alternative splicing pathway. Three single base substitu- 
tions located between the two alternative acceptor splice sites have been shown to 
be both necessary and sufficient to define the GH/ alternative splicing event 
(Estes et al., 1990). One of these changes is thought to specify a lariat branchpoint 
essential for alternative acceptor site usage whereas the other two bases may serve 
to modulate the frequency with which the site is selected. 
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Contractions and 
expansions in gene size 
and number 


There are now many pathological examples of the deletion, insertion, duplication 
and expansion of human genes causing inherited disease. Similar mutations have 
however also occurred over evolutionary time. Far from being invariably disad- 
vantageous, such mutational changes have often been recruited by the oppor- 
tunistic evolutionary process and now contribute to both gene and genome 
architecture. These types of mutation have led to significant changes in gene size 
and number in different lineages and their contribution to the evolution of extant 
human genes will now be reviewed. 


8.1 Gross gene deletions in evolution 


8.1.1 Gross gene deletions during primate evolution 


Gross gene deletions may arise through a number of different recombinational 
mechanisms but probably the most common is likely to be homologous unequal 
recombination (occurring either between related gene sequences or between repeti- 
tive elements). Thus, Alu sequences flanking deletion breakpoints have been 
noted in a considerable number of human genetic conditions and may represent 
favored sites for recombination and hotspots for gene deletions (Cooper and 
Krawczak, 1993; Chapter 1, section 1.5.4). Chromosomally duplicated regions 
(duplicons) are often common sites for pathological rearrangements, particularly 
gross deletions, since they have the potential to mediate homologous unequal 
recombination events (e.g. 15q11—q13; Christian et al., 1999). 

Not surprisingly, several instances of gross gene deletion have been noted dur- 
ing primate evolution. One such example is the loss of the yl-globin gene in New 
World monkeys (with the notable exception of the capuchin monkey, Cebus alb- 
ifrons) due to a 1.8 kb deletion which has removed most of exon 2, all of intron 2, 
exon 3, and much of the 3’ flanking region (Meireles et al., 1995). As a result, y2- 
globin is the primary fetally expressed globin gene in New World monkeys 
whereas in Old World monkeys, it is yl-globin. 

Another example of gross gene deletion in primates is the loss of one of the hap- 
toglobin (HPP; 16q22.2) genes in humans which occurred, probably via homolo- 
gous unequal recombination, after the separation of the human and chimpanzee 
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lineages (McEvoy and Maeda, 1988; see Chapter 6, section 6.2.6). Finally, although 
there are two semenogelin genes (SEMG1, SEMG2; 20q12-q13.1) in humans and 
most Old World and New World monkeys, the Semg2 gene has been deleted from 
the genome of the cotton-top tamarin (Saguinus oedipus) (Lundwall, 1998). There is 
no evidence for any selective advantage resulting from any of these gene deletions. 
In all cases cited, a similar paralogous gene was available to substitute for the 
deleted locus. This genetic redundancy probably ensured that, owing to the absence 
of purifying selection, such deletions came to be fixed through genetic drift. 


8.1.2 Gross deletional polymorphisms 


Gross gene deletions in two distinct glutathione S-transferase genes have been 
found to occur as polymorphic variants in various human populations. In humans, 
four gene families of glutathione S-transferases encode a series of enzymes respon- 
sible for the metabolism of a wide range of xenobiotics (DeJong et al., 1991; Pearson 
et al., 1993). The gene for the mu-class glutathione S-transferase (GSTM1; 1p13.3) 
is absent from between 10% and 64% of individuals depending upon the population 
under study (Board et al., 1990; Nelson et al., 1995; Seidegard et al., 1988). Loss of 
the GSTM1 gene is due to a 15 kb deletion probably brought about by homologous 
unequal recombination between two almost identical 4.2 kb repeats that flank the 
GSTM1 gene (Figure 8.1; Xu et al., 1998). These repeats are likely to have originated 
with the original duplication that gave rise to the mu-class glutathione S-transferase 
genes more than 20 Myrs ago (Xu et al., 1998). A similar deletional polymorphism 
also occurs in the theta-class glutathione S-transferase (GSTT1; 22q11.2) gene; the 
gene is absent in ~38% of the Caucasian population (Brockmoller et al., 1992; 
Pemble et al., 1994). Since the GSTM1 and GSTTI genes are involved in xenobiotic 
metabolism, it is quite possible that the presence/absence of these polymorphic 
alleles are not selectively neutral (Weber, 1997). 

Dichromacy (color blindness) is very common (occurring at a polymorphic fre- 
quency (58%) in Caucasian males) with individuals so affected having either a dele- 
tion of the green color pigment (GCP) gene or possessing a hybrid red/green color 
pigment (RCP/GCP) gene in its place (Deeb et al., 1995). Further variable gene 
number polymorphisms, which probably also arose by homologous unequal recom- 
bination, are apparent in the human a-amylase gene cluster at chromosome Ip21 
(Groot et al., 1989) and the pepsinogen A gene cluster at 11q13 (Taggart et al., 1987). 


GSTM4 GSTM2 GSTM1 GSTM5 
ear m EE tea E E 
HindIII | To, Ao 
13.6 52 N N 11.4 U7 103 
GSTM4 GSTM? NOK GSTM5 
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To r E 
Hind Ill 13.6 5.2 7.4 


Figure 8.1. Model for homologous recombination between 4.2 kb repeats (open boxes) 
flanking the human mu-class glutathione S-transferase (GSTM1) gene leading to gene 
deletion (Xu et al., 1998). The GSTM1, GSTM2, GSTM4 and GSTMS genes are 
represented by solid bars. Vertical lines denote EcoRI and HindIII restriction sites. The 
sizes of the EcoRI/HindIII fragments are shown. 
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In the first case, a ‘short’ haplotype lacks AMYIA, AMYIB and the pseudogene 
AM YP!1 whilst a ‘long’ haplotype contains two extra copies of a duplicated fragment 
containing AMYIA, AMYIB, and AMYP1. In the second case, the three most com- 
mon haplotypes were PGA-A (containing the PGA3, PGA4, and PGAS genes), 
PGA-B (containing the PGA3 and PGA4 genes) and PGA-C (containing only the 
PGA4 gene). Copy number polymorphisms due to gene deletions have also been 
reported at the human T-cell receptor B (TCRB; 7q35) and y (TCRG; 7p15) loci 
(Ghanem et al., 1989; Rowen et al., 1996), the C-globin (HBZ; 16p13.3) gene (Felice 
et al., 1986), the rhesus blood group D antigen (RHD; 1p34-36.2) gene (Colin et al., 
1991), the immunoglobulin heavy chain constant region y4 (IGHG4; 14q32) gene 
(Rabbani et al., 1996) and the complement C4A (C4A; 6p21.3) and C4B (C4B; 
6p21.3) genes (Teisberg et al., 1988). In some gene clusters, it can be difficult to ascer- 
tain whether polymorphic alleles have arisen by gene deletion or duplication; some 
examples of human duplicational polymorphisms are given in Section 8.5. 


8.2 Microdeletions in evolution 


Is it possible to extrapolate from lessons learned through the study of microdele- 
tions in a pathological context to microdeletions that have occurred during gene 
evolution? In particular, can we gain insight into the nature of the generative 
mechanism(s) underlying evolutionarily significant microdeletions and the pos- 
sible influence of the local DNA sequence environment? Although in principle 
the answer to this question is likely to be in the affirmative, in practice the DNA 
sequences that were originally responsible for mediating the microdeletions have 
often decayed or been lost and it may not always be possible to reconstruct them. 


8.2.1 Microdeletions in pathology 


Microdeletions (<20 bp) causing human genetic disease were analyzed by Cooper 
and Krawczak (1993) in an attempt to relate the presence of specific DNA 
sequence motifs in the vicinity of these lesions to possible mechanisms responsi- 
ble for their generation. In many cases, slipped mispairing at the replication fork 
between homologous sequences in close proximity to one another on comple- 
mentary DNA strands appeared to be the causative mechanism. Slipped mispair- 
ing probably occurred either between direct repeats or through the formation of 
secondary structure intermediates potentiated by the presence of inverted repeats 
or symmetric elements (Cooper and Krawczak, 1993; Krawczak and Cooper, 1991). 
A consensus sequence, TGRRKM, common to pathological deletion hotspots has 
been noted in a number of different human genes (Krawczak and Cooper, 1991). 
This deletion hotspot consensus sequence is similar to the core motifs, TGGGG and 
TGAGC, found in immunoglobulin switch (Sw) regions (Gritzmacher, 1989) and 
to putative arrest sites for DNA polymerase o (Weaver and DePamphilis, 1982). 
Cooper and Krawczak (1993) also found that a second motif (polypyrimidine runs 
of at least 5 bp; YYYYY) was over-represented in the vicinity of short human 
gene deletions whilst Monnat et al. (1992) observed a significant association 
between HPRTI (Xq26.1) gene deletion breakpoints and CTY vertebrate topoi- 
somerase I cleavage sites. In principle, such sequence motifs may also have pro- 
moted the occurrence of microdeletions during evolution. Indeed, in probably the 
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largest study of its kind to date, a sequence comparison of orthologous and paral- 
ogous members of the primate T-cell receptor B (TCRB; 7q35) gene family, micro- 
deletion breakpoints appear to be frequently flanked by polypyrimidine runs and 
sequences which possess marked homology to the deletion hotspot consensus 
sequence (Funkhouser et al., 1997). 


8.2.2 Microdeletions mediated by direct repeats 


Micro-deletions occur during gene evolution at a frequency 10% that of 
nucleotide substitutions (Saitou and Ueda, 1994). Pairwise comparisons of the 
noncoding regions of human, rabbit and murine B-globin genes have shown that 
they differ from each other in terms of numerous deletions/insertions and 
Efstratiadis et al. (1980) proposed that the short direct repeat (2-8 bp) sequences 
immediately flanking these sites could have templated the generation of these 
lesions by slipped mispairing (Figure 8.2). Direct repeats may also have been 
involved in generating the two inactivating single base-pair deletions (del C822 
and del G904) noted in the human a-1,3 galactosyltransferase (GGTA1; 9q34) 
gene (Larsen et al., 1990; see Chapter 6, section 6.2.2). Since chimpanzees possess 
both these deletions, whereas orangutan and gorilla only have del904, it would 


5' FLANKING 

Human 6 CCA------- GCATAAAA 

Human B CCAGGGCTGGGCATAAAA 

LARGE INTRON 

Human Ay ATGACTTTT----ATTAGAT 

Human Ay ATGACTTTTCTTTATTAGAT 

Human Ay TGTGTGIGTGTGTG-------------------- TGTGTGTGTGTG 

Human Gy TIGIGIGIGIGIGIGIGTGITGTGCGCGCGTGTGTTIGIGIGTGTGTG 

Rabbit g AAGTA----------------- CTTTCTCTAATC 

Mouse ß AGTCCTTCTCTCTCTCCTCTCTCTTTCTCTAATC 

Rabbit ß TGGTAG-------------------------------- AARACAACT 

Mouse B TIGGCTITIATGCCAGGGTGACAGGGGAAGAATATATITTACATAT 

Mouse B™ TGACATAGG-------------------------------------------- ATTCT 
Mouse B™ TGTCATAGAATAATTCTTTTTTATTTTTTATTTATTATTTTTTTCATAGAATAATTCT 
Mouse p™ TGTGTGTGG------------ AGTGTT 

Mouse pgmn GGTGIGITGGATGTGAATTGTGAGTGTT 

3' NONCODING 

Human e€ CAGGT----------- GTTCCT 

Human B CCAATTICTATTAAAGGTTICCT 

Human Ay GCAATACAAA---------------------------------------------- TAATAAAAT 
Human B GTCCAACTACTAAACTGGGGGATAT TATGAAGGGCCT TGAGCATCTGGATTCTGCCTAATAAAAA 
Human Ay ATACAA-------------------------------------------- TAATAAA 
Human ô TATTITCTGAACTTGGGAACAATGAATACTTCAAGGGTATGGCTICIGCCTAATAAA 
Rabbit B AAAAATTAT--------------------------------------- GGGGACA 
Human B AATTTCTATTAAAGGTTAATTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATA 


Figure 8.2. Examples of deletions flanked by short direct repeats within non-coding 
sequences of mammalian B-globin genes (redrawn from Efstratiadis et al., 1980). Pairwise 
alignments of sequences within non-coding regions of mammalian B-globin genes are 
shown. A deletion is assumed in the upper sequence with respect to the lower sequence. 
Dashes denote the nucleotides not present in the upper sequence. Short direct repeats are 
underlined. The two aligned human Ay large intron sequences are those of two different 
alleles. 
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appear that del904 was the original inactivating mutation (Galili and Swanson, 
1991). C822 is immediately flanked by imperfect direct repeats (TACAGGCCT 
and TACAAGGCAG, where C is nucleotide 822) that could have templated the 1 
bp deletion via slipped mispairing. Slipped mispairing may also have been respon- 
sible for the G904 deletion since G904 is the 3’ most base of a string of five Gs. 

A direct repeat may also have templated the single base deletion in the 5’ flank- 
ing region of the human interferon «10 (IFNA 10; 9p22) gene relative to the other 
a-interferon genes (see Figure 4.27). Flanking direct AGGT repeats appear to have 
mediated an AGG deletion exhibited by both orthologous and paralogous mem- 
bers of the primate T-cell receptor B (TCRB; 7q35) gene family whilst in the same 
gene family, overlapping 7 bp direct repeats (CTTTTCTTTTCT) may have 
served to template a TTTCT deletion (Funkhouser et al., 1997). 


8.2.3 Microdeletions mediated by inverted repeats 


Inverted repeats may also have mediated the generation of microdeletions during 
gene evolution. One example is the inactivating 13 bp deletion in exon 2 of the 
gibbon urate oxidase gene (Wu et al., 1992; see Chapter 6, section 6.2.1); two 
imperfect inverted repeats (CAAGAAC and GTTCATG) span the breakpoints of 
this deletion. A 20 bp deletion has been reported from the 5’ region of the 5-glo- 
bin (Hbd) gene of the colobus monkey, Colobus polykomos (Vincent and Wilson, 
1989). This deletion, which spans the transcriptional initiation site used in Old 
World monkeys and anthropoid apes, is responsible for a five-fold reduction in 
Hbd gene transcription as assessed by in vitro transcription assay. Inspection of the 
putative deleted bases and the flanking DNA sequence reveals the presence of a 13 
bp imperfect inverted repeat which could have been responsible for the deletion 
through formation of a hairpin loop (Figure 8.3). 

An inverted repeat also appears to have templated the deletion of a GAT codon 
in the human interferon 02 (IFNA2; 9p22) gene relative to the other a-interferon 
genes (Figure 4.27). Finally, a contiguous inverted repeat sequence (ATTC- 
CCAGTTTCTGGGAAT) may well have templated an 8 bp deletion exhibited by 
both orthologous and paralogous members of the primate T-cell receptor B 
(TCRB; 7q35) gene family (Funkhouser et al., 1997). 


+1 
C aagggagggcagag CTTCTGA 
R g a a ccaactgttgcttATACTTG 
B g g a tcaactgttgcttACATTTG 
S ČI O g ctaactgttgcttTGACTTG 
H c g a tcgactgttgcttACACTTT 


Figure 8.3. Alignment of 6-globin (Hbd) gene sequences from colobus monkey (C), 
rhesus macaque (R), baboon (B), spider monkey (S) and human (H) showing the location 
of a 20 bp deletion in the colobus monkey (after Vincent and Wilson, 1989). Transcribed 
sequences are denoted by upper case letters, flanking regions are in lower case letters. +1 
denotes the site of trancriptional initiation. Underlined bases represent an imperfect 
inverted repeat which may contribute to the formation of a hairpin loop. 
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8.2.4 Microdeletions in vertebrate evolution 


The comparison of gene/protein sequences between humans and the great apes 
also yields examples of in-frame microdeletions that must have occurred during 
primate evolution. Thus, amino acid residue Glu9 of the blue cone pigment pro- 
tein (BCP; 7q31-q35) present in the talapoin monkey Miopithecus talapoin (an Old 
World primate) and in the marmoset Callithrix jacchus (a New World primate) is 
absent from the human protein and appears to have been deleted from the BCP 
gene within the human lineage (Hunt et al., 1995). The functional consequences 
of the removal of this amino acid residue are however unclear. 

Some gene regions harbor a disproportionate number of deletions/insertions 
inferred from alignment gaps noted in sequence comparisons e.g. exons 6 of the 
orthologous amelogenin (AMELX, Xp22.1-p22.31; AMELY, Yp11.2) genes of 
various vertebrates. These lesions have dramatically reduced the similarity 
between vertebrate amelogenins in the Pro/Gln-rich region of the protein as com- 
pared with that manifested by other regions (Toyosawa et al., 1998). 

Some microdeletions occurring during evolution may have been advantageous 
by virtue of their alteration of a protein product, others through a change in the 
reading frame bringing about gene inactivation (see Chapter 6, section 6.2). Of 
course, micro-deletions need not necessarily have conferred any selective advan- 
tage; even if merely neutral with respect to fitness, they could have become fixed 
by genetic drift alone. 


8.3 Microinsertions in evolution 


Microinsertions that have occurred during evolution have scarcely been studied. 
However, the underlying generative mechanisms are likely to be broadly similar 
to those causing human genetic disease. In their study of microinsertions in 
human genes causing inherited disease, Cooper and Krawczak (1991) concluded 
that insertional mutation involving the introduction of <10 bp DNA sequence 
into a gene coding region was not a random process and appeared to be highly 
dependent upon the local DNA sequence context. Further, mechanistic models 
which have explanatory value in the context of gene deletions were found to be 
useful in accounting for the nature and location of gene insertions. 

In noncoding DNA, insertions are about half as frequent as deletions and 
mostly involve single nucleotides (De Jong and Ryden, 1981; Graur et al., 1989; 
Saitou and Ueda, 1994). The rate of gap formation, regardless of whether caused 
by insertions or deletions, has been estimated to be ~0.15-0.17 kb! Myrs? (Saitou 
and Ueda, 1994). In practice, studies of the DNA sequence environment of 
microinsertions that have occurred during evolution are likely to be rather diffi- 
cult since the original sequence context of the insertion or deletion will often have 
become obscured by subsequent mutation. 


8.3.1 Gene coding region microinsertions 


Microinsertions occurring during human gene evolution can be found by 
sequence comparison of either orthologous or paralogous genes/proteins. Thus, 


CONTRACTIONS AND EXPANSIONS IN GENE SIZE AND NUMBER — CHAPTER 8 335 


the 54 bp insertion in the promoter of the human liver arginase (ARG/; 6q22.3- 
q23.1) gene is absent in the orthologous gene of Macaca fascicularis (Goodman et 
al., 1994). Similarly, a 37 bp insertion has been introduced into the promoter of 
the orthologous Duchenne muscular dystrophy (DMD; Xp21) gene of the spider 
monkey, Ateles geoffroy (Fracasso and Patarnello, 1998); the inserted sequence is 
flanked by two TAAA repeats. A 12 bp insertion in the T-cell receptor o-chain 
(TCRA; 14q11.2) gene encodes an Ie-Pro-Ala-Asp tetrapeptide (residues 88-91) 
that is specific to the primate lineage (Thiel et al., 1995). Interestingly, the region 
of the T-cell receptor o-chain protein between positions 86 and 91 appears to be a 
hotspot for insertional events during evolution: the rabbit and rat genes appear to 
have acquired a 3 bp (single amino acid) insertion, whereas the bovine, ovine, and 
murine genes manifest a 6 bp (double amino acid) insertion at this position (Thiel 
et al., 1995). Finally, a 24 bp sequence found in the transmembrane domain region 
of the human glycophorin E (GYPE; 4q28-q31) gene appears to have been 
derived from the paralogous glycophorin B (GYPB; 4q28-q31) gene during pri- 
mate evolution prior to the divergence of the gorilla from the lineage of the other 
great apes (Rearden et al., 1993). It is unclear whether this insertion event was 
mediated by homologous unequal recombination or gene conversion. 


8.3.2 Microinsertion polymorphisms 


Several microinsertion polymorphisms have been reported in human genes. 
Thus, a single nucleotide insertion polymorphism has been noted in the promoter 
region of the insulin promoter factor 1 (PFI; 13q12.1; Yamada et al., 1998) gene. 
A single nucleotide insertion polymorphism is also present in the ABO blood 
group (ABO; 9q34; Olsson and Chester, 1996) gene which serves to inactivate it. 
Finally, a 9 bp insertion polymorphism in exon 9 of the cytochrome P450 
CYP2D6 (22q13.1) gene occurs in the Japanese population and is associated with 
a poor metabolizer phenotype (Yokoi et al., 1996). 


8.3.3 Indels 


Clearly, in extant proteins, selection must have ensured the retention of essential 
features of tertiary structure. Indeed, insertions or deletions which altered the 
reading frame must have been rendered harmless in order for the protein to retain 
its biological activity. Thus, the insertion of bases inferred in one member of a 
paralogous protein pair often implies a counterbalancing deletion in the immedi- 
ate vicinity (or vice versa) to restore the reading frame. Such ‘indels’ tend to 
involve sequences of between | bp and 5 bp in length (Pascarella and Argos, 1992). 
They are generally found in turn and coil structures and rarely interrupt a-helices 
and strands (Pascarella and Argos, 1992). One example of a simple indel occurring 
during evolution is the deletion of an AA doublet and its replacement with a GT 
doublet in the 5’ flanking region of the human interferon “a9” gene (Figure 4.26). 

An example of a more complex indel that has occurred during vertebrate evo- 
lution is provided by the human chorionic gonadotropin B-subunit (CGB; 
19q13.3) gene and is responsible for the introduction of a 24 amino acid C-terminal 
extension to the protein product (Talmadge et al., 1984). The CGB gene emerged 


336 HUMAN GENE EVOLUTION 


as a result of the duplication of an ancestral B-luteinizing hormone-like (LHB; 
19q13.3) gene. A single base deletion (A1540) altered the translational reading 
frame of the ancestral gene allowing read-through into what was originally 3’ 
untranslated region (UTR). An insertion of a CG dinucleotide also occurred at 
nucleotide 1612 within the ancestral 3’ UTR region and served to extend the 
chorionic gonadotropin B-subunit protein by a further 8 amino acids. The CGB 
gene therefore evolved from its LHB-like ancestor by acquiring 8 amino acids 
through translation in a new reading frame and incorporating 24 novel amino 
acids from the 3’ UTR into its coding sequence. 


8.4. Insertion of transposable elements in evolution 


Transposable elements in the human genome are essentially of two kinds, those 


Table 8.1. Retroelements in the human genome 


Retroelement type Retroelement Copy number % of genome 
family 
C-type-related HERVs HERV-ER1 0.07% 
superfamily 
HERV-E (4-1, 35-50 
ERVA, MP-2) 
HERV-E LTR 500-600 
51-1 35-50 
ERV1 10-15 
HERV-R (ERV3) 10 
RRHERV-1 20 
S71 15-20 
S71 LTR 50-100 
ERV-FRD 5-7 
ERV9 30-40 0.2% 
ERV9 LTR 3000-4000 
HERV-P (HuERS-P1, 50-90 0.01% 
HuERS- P2, HuERS- 
P3/HuERS-P) 
HERV-| (RTV, -I) 25-50 0.01% 
ERV-FTD 5-7 
C-type and HTLV- HERV-H (RTVL-H, 900-1000 0.2% 
related HERVs RGH) 
HERV-H-LTR 1000 
HRES1 2 
A-B- and D-type- HML families 1-6 50 0.5% 
related HERVs HERV-K (HM, 
HLM, HML-1) 
HERV-K LTR 10000-25000 
ERV-MLN (HML-4) 20-25 
THE-1 elements THE 1 10000 1% 
TTHE-1 LTR 30000 
Nonviral retroposons LINE-1 100000 5% 
Alu 500000 5% 


SINE-R 5000 0.1% 
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that undergo transposition through a DNA intermediate, and those that undergo 
transposition through an RNA intermediate. Transposable elements with a DNA 
intermediate are termed transposons and these are characterized by terminal 
inverted repeats and duplication of the target site (visible as direct repeats flank- 
ing the element). However, the great majority of transposable elements in the 
human genome have undergone retrotransposition through an RNA intermedi- 
ate. Such retroelements or retroposons may be of either viral or nonviral origin (Table 
8.1). Whilst endogenous retroviral sequences comprise some 0.1-0.6% of the 
human genome (Leib-Mosch et al., 1990), the nonviral Alu sequences and LINE 
elements may together make up as much as 10% of the genome. 


8.4.1 Endogenous retroviral sequences and transposable elements 


Retroposons. A number of retroposon families have been characterized in pri- 
mate genomes (Leib-Mésch and Seifarth, 1996; McDonald, 1993; Table 8.1). 
Occasionally, these are human-specific (e.g. the HERV-K10-related SINE-R.C2; 
Zhu et al., 1992; 1994) but usually they are found distributed through the 
genomes of other primate species (e.g. RTVL-H (Goodchild et al., 1993), RTV L-I 
(Maeda and Kim, 1990), HERV-K (Steinhuber et al., 1995), and HERV-L 
(Cordonnier et al., 1995)). Type I and II HERV-H elements were amplified to 
~1000 copies after the divergence of New World from Old World monkeys but 
before the divergence of apes from Old World monkeys (Leib-Mésch and 
Seifarth, 1996). By contrast, the family of type Ia HERV-H elements expanded to 
~100 copies only after the divergence of apes from Old World monkeys. Analysis 
of the copy number, distribution and sequence characteristics of such endogenous 
elements promises to provide important clues as to the evolutionary history and 
phylogeny of the various mammalian orders, suborders, and species, and even the 
population genetics of human racial groups (Furano and Usdin, 1995). 

Retroposons have sometimes become integrated into the vicinity of human 
genes. Thus, two copies of an RTV,-I sequence are present in the human haptoglo- 
bin (HP; 16q22.1) gene cluster whilst an additional copy has been inserted in the 
same region in the orthologous chimpanzee gene (Maeda and Kim, 1990). Rather 
more dramatically, the endogenous retrovirus, HRES-1 lies within the coding 
sequence of the human transaldolase gene (TALDOI; 11p15; Banki et al., 1994). 

One view of endogenous retroviral elements is that they inserted themselves 
into the germline of our primate ancestors during the last 40 Myrs as a result of 
infection with exogenous retroviruses, persisting thereafter as proviruses, albeit 
rendered replication-defective by multiple mutational events (Shih et al., 1991). 
Another (not incompatible) view is that retroviruses themselves originally arose 
from intracellular retroelements (the protovirus hypothesis), a view which is sup- 
ported by the phylogenetic analysis of endogenous retroviral DNA sequences 
(Figure 8.4). Regardless of whether or not the horizontal transmission of retrovi- 
ral elements has taken place, copy number amplification has certainly occurred. 

Retrotransposition often generates retroposon sequence variants owing to its 
inherent imprecision: target site rearrangements combine with the infidelity of 
both reverse transcriptases and RNA-dependent RNA polymerases to ensure that 
the inserted sequences are highly variable thus providing new avenues for the 
highly opportunistic evolutionary process (Preston, 1996). 
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Figure 8.4. Phylogenetic analysis of mammalian endogenous retroviral pol sequences 
(after Leib-Mosch and Seiforth, 1996). HERV: Human endogenous retroviral elements 
ERV: Endogenous retrovirus BaEV: Baboon endogenous virus GaLV: Gibbon ape 
leukemia virus AKV: Endogenous murine leukemia virus MoMuLV: Moloney murine 
leukemia virus MMTV: Mouse mammary tumor virus IAP-M: Murine intracisternal A- 
type particles. 


Transposons. Transposon-like THE-1 repeats, which lack any obvious homology 
to retroviral sequences, have been found in the deletion-prone intron 43 of the 
dystrophin (DMD; Xp21.2-p21.3) gene (Pizzuti et al., 1992), the human blood 
group GC (4q12) gene (Witke et al., 1993) and the 3’ untranslated region of the 
human calmodulin-related protein (CALMLI1; 7p13-pter) gene (Deka et al., 
1988b). A cluster of three THE-1 repeats located in a 26 kb region of intron 7 in 
the human DMD gene has arisen by three independent insertion events 
(McNaughton et al., 1993, 1995). There is some evidence to support the hypothe- 
sis that the insertion of these elements has occurred at preferred target sites (Deka 
et al., 1988a). 

A pseudoautosomal gene sequence (Tramp) has recently been isolated which 
encodes within its single exon a protein with homology to transposases (enzymes 
that mediate transposition) of the Ac family (Esposito et al., 1999). It is as yet 
unclear whether the Tramp protein has been involved in the transposition of 
other transposable elements or if it has instead become specialized for a novel cel- 
lular function. The centromeric protein CENP-B (CENPB; 20p13) may also rep- 
resent an example of a transposase-encoded protein which has acquired a cellular 
function. This protein binds to the CENP-B box (TTCGNNNNANNCGGG) 
sequence in the alpha satellite DNA of human centromeres and has sequence sim- 
ilarity to the pogo family of transposases which includes the Tigger elements 
(Kipling and Warburton, 1997). Since CENP-B has nicking activity, it may pro- 
mote homologous recombination and could have contributed to the species-spe- 
cific patterns of evolution of satellite repeats. 
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Since many transposable elements contain enhancer sequences, their trans- 
position may have served to alter the pattern of host gene expression at or around 
the integration site. Thus, once transposed, evolution may have recruited such 
enhancers to play a role in the transcriptional regulation of a gene in the vicinity 
of the integration site (see Chapter 5, section 5.1.12). One example of this is the 
human salivary amylase (AMYIC; 1p21) gene where the HERV-E-derived 
enhancer may be involved in tissue-specific expression (Ting et al., 1992; Chapter 
5, section 5.1.12, Endogenous retroviral elements). 

Evolution may also recruit transposable elements as a means to alter mRNA 
processing. For example, a B2 (SINE) element has become inserted into the 3’ 
untranslated region of the murine (ifr) gene encoding the soluble form of the 
leukemia inhibitory factor receptor (LIFR; Michel et al., 1997). Insertion of the 
B2 element has, by potentiating alternative 3’ mRNA processing and alternative 
splicing, given rise to a truncated mRNA species (relative to the mRNA encoding 
the membrane-anchored LIFR) which encodes soluble LIFR. In the rat, no such 
retrotranspositional event has occurred and the soluble form of LIFR is not 
found. 

A very special case of the opportunistic recruitment of a transposable element 
may have been that of the recombination-activating gene (RAG) transposase pos- 
tulated to have been inserted into an ancestral immunoglobulin/T-cell receptor 
gene soon after the divergence of jawed and jawless fishes (Chapter 9, section 
9.4.2). The subsequent conversion of this transposon into a site-specific recombi- 
nase may have been the critical event in allowing the vertebrates to generate the 
genetic diversity so essential for the flexible adaptive response of their immune 
systems. 


8.4.2 LINE elements 


LINE elements have assumed considerable importance in the context both of 
gene pathology and gene evolution (Kazazian and Moran, 1998). They are present 
in a wide range of mammals (Furano and Usdin, 1995) and are represented by 
some 40 subfamilies (Smit et al., 1995). They are nonrandomly distributed in the 
human genome, inserting preferentially into chromosomal G bands (Wichman et 
al., 1992) and at the DNA level, into A-rich sequences (Vanin, 1984). The total 
numbers of LINE elements in four of the great apes have been estimated by Hwu 
et al. (1986): human, 107 000; chimpanzee, 51 000; gorilla, 64 000; and orangutan, 
84 000. Since these figures differ markedly, it follows that numerous insertions 
and deletions of these sequences must have occurred during the evolution of the 
great apes. 

In a pathological context, a number of examples of gene inactivation through 
insertion of LINE elements into gene coding sequences are known (Miki et al., 
1992; Narita et al., 1992; reviewed by Cooper and Krawczak, 1993) and in some 
cases a preference for integration at AT-rich sequences is exhibited (Kariya et al., 
1987; Kazazian et al., 1988). Further, the target sites of two LINE elements 
inserted into the factor VIII (F8C; Xq28) gene causing hemophilia A (Kazazian et 
al., 1988) are 80% homologous to a 10 nucleotide motif (GAAGACATAC) present 
in one of the highly favored retroviral insertion target sequences reported by Shih 
et al. (1988). 
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Preferential target sites for LINE elements are also apparent in mammalian 
genes during evolution. For example, the interleukin-6 genes of rodents represent 
hotspots for LINE element retrotransposition (Qin et al., 1991). During mam- 
malian evolution, the introduction of LINE elements in the vicinity of genes has 
sometimes altered gene expression as a consequence of their being recruited to 
perform a regulatory function (examples of this phenomenon are given in Chapter 
5, section 5.1.12, Alu sequences). LINE elements have also served to promote 
genetic rearrangements and indeed they may well have mediated both gene inver- 
sion (Chapter 9, section 9.1) and duplication (Section 8.5) events during evolu- 
tion. Finally, and perhaps most importantly, LINE elements may have repeatedly 
transduced exons from one genomic location to another, thereby potentiating 
exon shuffling (Chapter 3, section 3.6.1), the transfer of exons encoding specific 
protein modules between genes. 


8.4.3 Alu sequences 


Evolution of Alu sequences. The fossil Alu monomer is thought to have arisen 
by the deletion of the central S domain of 7SL RNA (RN7SL) followed by the 
addition of a 3’ poly(A) tract which may have facilitated reverse transcription of 
these RNA polymerase III transcripts (Mighell et al., 1997; Figure 8.5). The free 
left arm monomer then arose by deletion of 42 bp from the fossil Alu monomer 
whilst the free right arm monomer arose by deletion of 11 bp from the fossil Alu 
monomer (Mighell et al., 1997; Figure 8.5). The first Alu sequence may have been 
formed 
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Figure 8.5. Proposed model of dimeric Alu formation via intermediate monomeric units 
derived from 7SL RNA which is neither capped nor polyadenylated. Important 
nucleotide positions are marked on the schematic 7SL RNA moiety. See text for 
explanation of the progression from 7SL RNA to the first dimeric Alu repeat. The 
approximate positions and consensus sequences of the RNA pol III promoter boxes A and 
B in the Alu sequence are marked. Redrawn from Mighell et al. (1997). 
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Figure 8.6. The proposed evolution of the 12 human Alu subfamilies. Numbers in 
parentheses represent approximate times (in Myrs) of insertion of different subfamilies 
into the human genome (redrawn from Mighell et al., 1997). 


by dimerization of a free left arm monomer with a free right arm monomer 
(Figure 8.5), an event which is thought to have occurred about 60 Myrs ago, before 
the divergence of prosimians (Zietkiewicz et al., 1998). Subsequently, many 
rounds of sequential amplification took place to generate the 12 human Alu sub- 
families seen today (Mighell et al., 1997; Figure 8.6). 

The total numbers of copies of Alu sequences in four of the great apes have 
been estimated by Hwu et al. (1986): human, 910 000; chimpanzee, 330 000; 
gorilla, 410 000; and orangutan, 580 000. As with the LINE elements, it would 
appear that numerous insertions and deletions of these sequences have occurred 
during the evolution of the great apes. At the chromosomal level, Alu sequences 
insert preferentially into R bands (Wichman et al., 1992) whereas at the DNA 
level, they preferentially integrate into A-rich sequences (Batzer et al., 1990; 
Daniels and Deininger, 1985; Matera et al., 1990). 

During mammalian evolution, the introduction of Alu sequences in the vicin- 
ity of genes has sometimes altered gene expression as a consequence of their being 
recruited to perform a regulatory function; examples of this phenomenon are 
given in Chapter 5, section 5.1.12, Alu sequences. Alu sequences may also have been 
involved in, or mediated, many other different types of gene rearrangement dur- 
ing gene evolution including gross deletions (Section 8.1), duplications (Section 
8.5), transpositions (Chapter 9, section 9.2), gene fusions (Chapter 9, section 9.3), 
recombination (Chapter 9, section 9.4) and gene conversion events (Chapter 9, 
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section 9.5). 


Alu sequence polymorphisms. Once inserted, specific Alu sequences have often 
been relatively stable in terms of their location during primate evolution (Sawada 
et al., 1985). This notwithstanding, some human Alu sequences are polymorphic 
in terms of their presence or absence (Batzer et al., 1994; Edwards and Gibbs, 
1992; Kass et al., 1994; Meagher et al., 1996; Milewicz et al., 1996; Tishkoff et al., 
1996; Zucman-Rossi et al., 1997), a situation which is sometimes also found in 
other primates (Bailey and Shen, 1993). Some of these polymorphisms may be of 
functional significance e.g. a common insertion/deletion polymorphism 
(0.40/0.60) within intron 16 of the human angiotensin I converting enzyme 
(DCP1; 17q23) gene, explicable in terms of the presence or absence of a 287 bp 
Alu repeat, is known to have an important influence on serum enzyme concentra- 
tion (Rigat et al., 1990). Alu sequence retrotransposition is also an occasional cause 
of genetic disease (e.g. Muratani et al., 1991; Vidaud et al., 1993; Wallace et al., 
1991). 


Alu sequence target sites. Although Alu sequences occur on average every 3-6 
kb, there are several examples of regions that appear to be preferential target sites 
for Alu sequence insertion in mammalian genes. Thus, a 40 kb region, spanning 
the spermatid-specific protamine genes PRM1, PRM2 and the transition protein 
(TNP2) gene (16p 13.2), contains a total of 42 Alu sequences (Nelson and Krawetz, 
1994). Similarly, a 22 kb region telomeric to the HLA-B-associated transcript 2 
(BAT2; D6S51E; 6p21.3) gene in the HLA class III locus contains 42 Alu repeats 
(Iris et al., 1993), whilst a 2.2 kb segment 5’ to the human lysozyme (LYZ; 12) gene 
contains four such repeats (Riccio and Rossolini, 1993). 


Alu sequences within protein-coding sequences. Many Alu sequences are 
found within introns and therefore this repeat is represented in heterogeneous 
nuclear RNA. Alu sequences are however also found at different locations within 
mRNA-homologous sequences, the majority occurring within the untranslated 
regions (UTRs). Thus, Yulug et al. (1995) reported that 5% of full-length human 
cDNAs contained an Alu sequence, with 82% and 14% of these being located in 
the 3’ UTRand 5’ UTR, respectively. In a few cases, however, Alu sequences have 
been incorporated into the coding sequences of human genes and have therefore 
altered the amino acid sequences of the encoded proteins. Thus, a 279 bp Alu 
sequence spans 103 bp of the coding region of a zinc finger protein (ZNF91; 
19p12) gene and extends 166 bp into the 3’ UTR (Yulug et al., 1995). Similarly, 110 
bp of Alu sequence lies within the coding a region of the lectin-like type II inte- 
gral membrane protein (KLRC1; 12) gene with 43 bp extending into the 3’ UTR 
(Yulug et al., 1995). An Alu sequence is entirely contained within the coding 
region of the protein serine/threonine kinase stk2 (STK2; 3p21.1) (279 bp Alu; 
Yulug et al., 1995) gene. Finally, and perhaps most dramatically, two Alu 
sequences (both with poly(A) tails) are entirely contained within the coding 
region of the regulator of mitotic spindle assembly 1 (RMSA1; 17p11.2-p12) gene 
accounting for 111 amino acids of its coding potential, some 40% of the total 
(Margalit et al., 1994). 
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Figure 8.7. Ornithine -aminotransferase deficiency caused by a mutation in a resident 
intronic Alu element within the OAT gene. A point mutation within the right subunit of 
the inversely oriented intronic Alu repeat activates a donor splice site (from Labuda et al., 
1995). 


Splice-mediated insertion of Alu sequences. Alu sequences have also been 
found to alter protein coding sequences through the splice-mediated insertion of the 
repeat and this probably represents the major mechanism by which Alu sequences 
have entered protein coding regions. Figure 8.7 illustrates the principle involved 
by reference to the pathological example of ornithine -aminotransferase defi- 
ciency caused by a single base-pair substitution in an intronic Alu element in the 
human ornithine 5-aminotransferase (OAT; 10q26) gene; this lesion activates a 
cryptic donor splice site which results in the incorporation of the Alu sequence 
into the mRNA. 

In an evolutionary context, there are several examples of the splice-mediated 
insertion of Alu repeats into human gene coding sequences. The splice-mediated 
insertion of a 95 bp Alu sequence has been reported in the lecithin: cholesterol 
acyltransferase (LCAT; 16q22.1) gene (Miller and Zeller, 1997). In humans, the 
alternate Alu-containing transcript represents between 5% and 20% of the LCAT 
mRNA. It is also present in LCAT mRNA from chimpanzee, gorilla and orang- 
utan; in the latter, the Alu-containing mRNA species constitutes 50% of the total 
LCAT mRNA pool (Miller and Zeller, 1997). It is not however present in the 
LCAT genes of gibbons, or Old World and New World monkeys (Miller and 
Zeller, 1997). 

In the human biliary glycoprotein (BGP; 19q13.2) gene, three mRNA variants 
are produced as a result of the alternative splicing of an exon (IIa) with one of two 
virtually identical Alu cassettes derived from two intronic repeats (Figure 8.8). 
Other such examples are to be found in the human REL (2p12-p13) proto-onco- 
gene, and the complement decay-accelerating factor (DAF; 1q32) and comple- 
ment C5 (C5; 9q33) genes (Makalowski et al., 1994). Of the 17 Alu sequences 
found in mRNA coding regions by Makalowski et al. (1994), seven contained in- 
frame Stop codons and three others were predicted to cause frameshifts. Thus it is 
perhaps not surprising that in several cases of mRNAs containing Alu sequences, 
allelic exclusion is evident and the mRNA containing the Alu sequence is of low 
abundance compared to splice variants of the same gene that lack the repeat 
(Mighell et al., 1997). This notwithstanding, it may well be that the splice- 
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Figure 8.8. A scheme for alternative splicing in the human biliary glycoprotein (BGP) 
mRNA. Boxes represent exons and arrows the five intronic antisense Alu elements. 
TM, exon encoding the transmembrane domain. Dotted lines indicate the splicing 
patterns found in the three cDNAs (after Barnet et al., 1993). 


mediated insertion of Alu repeats has been an important evolutionary mechanism 
for creating diversity at the protein level. 


Alu sequence incorporation by intron sliding. An alternative mechanism of 
Alu sequence incorporation into gene coding regions is intron sliding, and is illus- 
trated by the example of the human HLA-DRBI (6p21.3) gene; an intronic Alu 
sequence has been incorporated into exon 4 of the HLA-DR-B1 mRNA (Labuda 
et al., 1995; Figure 8.9). Among three variants of the HLA-DR-Bl cDNA, 
detected by library screening, one was considered to be the usual form, whereas 
two others were alternatively spliced owing to a lack of splicing at the intron 5 
donor site. As a result, exon 5 was extended into a nearby downstream Alu 
sequence in intron 5, either to include a stop codon within the Alu sequence, or to 
be spliced with exon 6 (the open reading frame in the extended exon 5 matches 
that of exon 6). These three cDNA clones may thus illustrate two phases of intron 
sliding: the inactivation of an existing splice site followed by the activation of a 
cryptic one. 
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Figure. 8.9. A resident intronic Alu sequence is incorporated into exon 4 of the HLA-DR- 
B1 mRNA by intron sliding (from Labuda et al., 1995). Among three variants of the 
HLA-DR-f1 cDNA, detected by library screening, the upper one is considered to be the 
usual form, whilst the two others are alternatively spliced, apparently due to a lack of 
splicing at the intron 5 donor site. As a result, exon 5 is extended into a nearby 
downstream Alu sequence in intron 5 either to include a stop codon within the Alu 
sequence or to be spliced with exon 6 (the open reading frame in the extended exon 5 
matches that of exon 6). These three cDNA clones may illustrate two phases of ‘intron 
sliding’: inactivation of an existing splice site followed by activation of a cryptic one. 


8.5 Gross gene duplications in evolution 


Gene duplication (or partial duplication) events are a fairly uncommon cause of 
human genetic disease (reviewed by Hu and Worton, 1992; Mazzarella and 
Schlessinger, 1998). Two distinct mechanisms are currently envisaged: (i) homologous 
unequal recombination either between homologous chromosomes or sister chro- 
matids and (ii) nonhomologous recombination at sites with minimal homology. 
Topoisomerase cleavage sites have been reported to be associated with pathologi- 
cal gene duplications (Kornreich et al., 1990; Hu et al., 1991) and potential sites 
for topoisomerases I and/or II have been found to coincide with the breakpoints 
of duplications in the human factor VIII (F8C; Xq28; Casula et al., 1990) and dys- 
trophin (DMD; Xp21.2-p21.3; Hu et al., 1991) genes. These observations are 
potentially interesting since topoisomerase activity has been implicated in several 
cases of nonhomologous recombination (Bullock et al., 1985). 

One of the best studied gross duplications in human genome pathology is the 
1.5 Mb duplication of the short arm of chromosome 17 associated with Charcot- 
Marie-Tooth disease type 1A. This recurring duplication is thought to be medi- 
ated by homologous unequal recombination between two misaligned ~30 kb 
CMT1A-REP repeat sequences flanking the CMTIA region in direct tandem ori- 
entation (Reiter et al., 1996). In humans, these repeats are 98% identical. 
Chimpanzees have two copies of a CMT1A-REP-like sequence, whereas gorilla, 
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orangutan and gibbon only havea single copy consistent with a duplication of the 
CMT1A-REP sequence after gorilla diverged from the human lineage but before 
the divergence of chimpanzee and human (Kiyosawa and Chance, 1996). 
Orthologous sequence comparison has provided evidence that the distal repeat 
was the progenitor copy. 


8.5.1 Duplications and the emergence of paralogous genes 


During vertebrate evolution, novel genes have arisen by genome duplication 
(tetraploidization; Chapter 2, section 2.1.1), intra-chromosomal regional duplica- 
tion (Chapter 2, section 2.1.1), and localized individual gene duplication. All 
three mechanisms give rise to paralogous genes, genes that occur within the same 
species and which have a common ancestor. Paralogous genes therefore include 
the members of multigene families and superfamilies. Evidence for the common 
ancestry of paralogous genes may come from sequence homologies (e.g. as with 
the voltage-sensitive ion channel genes; Strong et al., 1993) and/or from similar 
exon-intron organization, for example the cholesterol ester transfer protein 
(CETP; 16q21) and the phospholipid transfer protein (PLTP; 20q12-q13) (Tu et 
al., 1995) genes, or the growth hormone receptor (GHR; 5p12-p14), prolactin 
receptor (PRLR; 5p13-p14) and interferon receptor œ, B, and œ, 1 WFNARI; 
21q22.1) genes (Lutfalla et al., 1992). 


8.5.2 Intra-chromosomal regional duplication 


In the human genome, whole chromosomal segments have sometimes been dupli- 
cated (see Chapter 2, section 2.1) resulting in a series of paralogous genes retain- 
ing their syntenic arrangement (Endo et al., 1997; Mazzarella and Schlessinger, 
1997). Thus, a number of genes located at 6p21.3 have paralogous genes at 9q33- 
q34 (Endo et al., 1997); the chromosome 6 loci include genes for type 11 collagen 
a2 subunit (COLIIA2), NOTCH4, 70 kDa heat shock proteins (HSPAIA, 
HSPAIB, HSPAIL), valyl-tRNA synthetase 2 (VARS2), complement compo- 
nents (C2, C4A, C4B), pre-B cell leukemia transcription factor 2 (PBX2) and 
retinoid X receptor B (RXRB) whilst the chromosome 9 paralogues include 
COLSAI1, NOTCH1, HSPAS, VARS1, C5, PBX3, and RXRA. Other extensive 
chromosomal duplications included genomic segments present at Xq28 and 
16p11.1 involving the paralogous creatine transporter genes SLC6A8 and 
SLC6A10 respectively (Eichler et al., 1996). 

Some intra-chromosomal duplications may only involve one or a small number of 
genes, for example the duplication of the iduronate-2-sulphatase (ZDS) locus at Xq28 
(Timms et al., 1995; Bondeson et al., 1995). Another example is that of the inverted 
duplication at 5q13 which duplicated the spinal muscular atrophy (SMA) gene, the 
survival motor neuron (SMN) gene and the apoptosis inhibitory protein (NAIP) gene 
(Campbell et al., 1997). Duplicated paralogous genes may however be translocated to 
quite different locations on the same chromosome, for example the adrenergic recep- 
tor (ADRAIB and ADRB2) genes on 5q23-q32 are quite distant from the evolution- 
arily related serotonin receptor (HTRIA) gene on S5cen-q11 (Oakey et al., 1991). 


CONTRACTIONS AND EXPANSIONS IN GENE SIZE AND NUMBER — CHAPTER 8 347 


8.5.3 Tandem duplications 


Multigene families often form syntenic gene clusters as a result of the tandem 
duplication of an ancestral gene sequence. For example, the immunoglobulin 
genes are clustered at 14q32.33 GHA, IGHD, IGHG) and 22q11.2 UGLC, 
IGLL) (see Chapter 4, section 4.2.4, Immunoglobulin genes), the T-cell receptor 
genes at 14q11.2 (TCRA, TCRD), 7q35 (TCRB) and 7p14-p15 (TCRG) (see 
Chapter 4, section 4.2.4, T-cell receptor genes), the pregnancy-specific glycoprotein 
(PSG1, PSG2, PSG3, PSG4, PSGS, PSG6, PSG7, PSG8, PSG11, PSG12, 
PSG13) genes at chromosome 19q13.2, the histocompatibility antigen (HLA) 
genes at chromosome 6p21.3 (see Chapter 4, section 4.2.1, Genes of the major histo- 
compatibility complex) whilst the three alkaline phosphatase genes (ALPP, ALPI, 
ALPPL2) are clustered at chromosome 2q37. 

Some genes appear especially prone to duplicate, probably by virtue of their 
already being clustered in multiple copies. Thus, the carcinoembryonic antigen 
(PSG, CEA; 19q13.2) gene family has undergone multiple, but independent, mul- 
tiplication events in both the rodent and primate lineages (Rudert et al., 1989). In 
similar vein, a disproportionate fraction of mapped zinc finger gene family mem- 
bers are located on chromosome 19 (Lichter et al., 1992; Chapter 4, section 4.2.3, 
Zinc finger genes). Such clustering has probably arisen as a result of the serial 
duplication of a single ancestral gene on the same chromosome. 

Members of gene families in different species may however be amplified differ- 
entially. For example, in the human genome, three members of the formylpeptide 
receptor gene family (FPRI, FPRLI, and FPRL2) cluster at chromosome 
19q13.3, whereas in mouse, there are six Fpr genes on a syntenic region of chro- 
mosome 17 (Gao et al., 1998). The human FPRL2 gene and four murine Fpr genes 
arose after the divergence of human and mouse. 


8.5.4 Translocation of duplicated genes 


Gene family members have often become chromosomally separated through 
translocation (Chapter 9, section 9.2) and it would appear as if the older the multi- 
gene family, the more likely this is to happen. Thus, the chromosomally unlinked 
CD36L2 (chromosome 4), CD36L1 (chromosome 12) and CD36 (7q11.2) mem- 
brane glycoprotein genes were duplicated and diverged from an ancestral gene 
prior to the separation of the arthropod and vertebrate lineages (Calvo et al., 1995). 
Other examples of ancient duplicated genes that have become separated by 
translocation are the thrombospondin (THBS1, 15q15 and THBS2, 6q27) genes 
which arose ~900 Myrs ago (Lawler et al., 1993) and the transforming growth fac- 
tor-B (TGFBI, 19q13.2; TGFB2, 1q41; TGFB3, 14q24) genes which arose ~300 
Myrs ago (Burt and Paton, 1992). 


8.5.5 Duplication of translocated genes 


Genomic duplication may be followed by further local tandem duplication as 
exemplified by the human homeobox gene clusters at 7p14-p15 (HOXA), 17q21- 
q22 (HOXB), 12q12-q13 (HOXC) and 2q31 (HOXD) (see Chapter 4, section 4.2.1, 
Homeobox genes). The same phenomenon is exhibited by the human sodium 
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channel genes many of which reside in the same paralogous chromosome seg- 
ments as the HOX gene clusters: SCNIA, SCN2A, SCN3A, SCN6A, SCN7A and 
SCN9IA (2q23-q24), SCN8A (12q13), SCN4A (17q23-q25), SCNSA and SCN10A 
(3p21—p24) (Plummer and Meisler, 1999). 

By contrast the localized duplication of translocated genes appears to be a gen- 
eral feature of the human olfactory receptor (OLFR) genes (Trask et al., 1998b). 
Thus, several OLFR gene clusters on chromosome 11 (11p15, 11p13, 11q24; 
Buettner et al., 1998) are more similar to each other than to OLFR genes on chro- 
mosome 17 (Ben-Arie et al., 1994), implying that translocation was followed by 
regional tandem duplication. However, the OLFR gene cluster on chromosome 
17p13.3 contains members that do not belong to the same subfamily, suggesting 
that it originated instead by duplication of an entire gene cluster followed by 
translocation of that cluster (Ben-Arie et al., 1994). 


8.5.6 Syntenic relationships and gene dispersal 


Gene duplication sometimes creates gene families whose members are both syn- 
tenic and dispersed, for example the human purinoceptor gene family (P2RY1, 3; 
P2RY2, 11q13.5-q14.1; P2RY4, Xq13; P2RY6, 11q13.5, P2RY7, chromosome 14; 
Somers et al., 1997). Seven human matrix metalloproteinase genes cluster at chro- 
mosome 11q22.3 (MMP1, MMP3, MMP7, MMP8, MMP10, MMP12, MMP13; 
Pendas et al., 1996), but the other family members are dispersed between many 
other chromosomes (MMP2, 16q13; MMP9, 20q11.2-q13.1; MMPI11, 22q11.2; 
MMP14, 14q11-q12; MMP15, 16q21; MMP16, 8q21; MMP19, 12q14). Other 
examples include the human fucosyltransferase (FUTI, FUT2, FUT3, FUT5, 
FUT®6, 19p13.3; FUT4, 11q21; FUT7, 9q34; FUTS8, 14q23; Costache et al., 1997) 
and annexin genes (ANXI, 9ql1-q22; ANX2, 15q21-q22; ANX3, 4q13-q22; 
ANX4, 2p13; ANX5, 4q28-q32; ANX6, 5q32-q34; ANX7, 10q21.1-q21.2; ANX8, 
10q11.2; ANX11, 10q21.1-q21.2; ANX13, 8q24.1-q24.2; Morgan et al., 1998). 

In some cases, evolutionary conservation of post-duplicational clustering may 
be important for the coordinate regulation of individual genes by common con- 
trol elements (see Chapter 5, section 5.1.14). Possible examples of this include the 
spermatid-specific protamine (PRM1 and PRM2; 16p13.2) genes (Nelson and 
Krawetz, 1994), the platelet membrane glycoprotein (TGA2B and ITGA3; 
17q21-q22) genes (Bray et al., 1988), the albumin gene family (ALB, AFP, AFM, 
GC; 4q11-q13; Nishio et al., 1996), the pregnancy-specific glycoprotein (PSGI, 
PSG2, PSG3, PSG4, PSG5S, PSG6, PSG7, PSG8, PSGI1, PSG12, PSG13; 
19q13.2) genes (Khan et al., 1992) and the fibrinogen a, B and y (FGA, FGB and 
FGG; 4q31) genes (Fu et al., 1992; Roy et al., 1994). 

Individually duplicated genes may still exhibit synteny simply because recom- 
bination has not yet separated them. Thus, the paralogous pulmonary surfactant 
protein D (SFTPD) gene at 10q22-q23 lies in very close proximity to the pul- 
monary surfactant protein A (SFTPA1) gene (Kélble et al., 1993). Similarly, the 
paralogous the interferon o/f receptor gene FNAR1) is closely linked to the 
cytokine receptor B4 (CRFB4) gene on chromosome 21q22.1 (Lutfalla et al., 
1995). Synteny may however be retained for very long periods of evolutionary 
time. Indeed, the human proteasome a2 subunit (PSMAZ2) and TATA box-bind- 


CONTRACTIONS AND EXPANSIONS IN GENE SIZE AND NUMBER — CHAPTER8 349 


ing protein (TBP) genes are linked on chromosome 6q27 and their orthologues 
are also syntenic in Drosophila melanogaster and Caenorhabditis elegans (Trachtulec, 
1997). Similarly, conserved synteny is also apparent between the human genes 
PIM1, RXRB, PBX2, NOTCH4 and TNXA at 6p21 and their orthologues in D. 
melanogaster and C. elegans (Trachtulec, 1997). The evolutionary conservation of 
close linkage over such long time periods is highly unusual and implies the exis- 
tence of underlying functional reasons. 


8.5.7 Functional redundancy and post-duplication diversification 


Evolution is opportunistic and gene duplication provides the opportunity for 
structural and functional diversification. This is well exemplified by the origin of 
the a-lactalbumin gene (LALBA; 12q13). a-Lactalbumin may be regarded as 
essentially a mammalian ‘invention’. It is a regulatory subunit of the enzyme lac- 
tose synthetase. Phylogenetic analysis has indicated that a-lactalbumin evolved 
from lysozyme (LYZ; chromosome 12) (a bacteriolytic enzyme present in both 
vertebrates and invertebrates) through a gene duplication event that occurred 
before the divergence of mammals and birds, and prior to the evolution of the 
mammary gland (Prager and Wilson 1988). The origin of the LALBA gene thus 
appears to have long preceded the acquisition of its modern function. Its rapid 
evolution in the mammalian lineage need not however have been solely due to its 
acquisition of a new function. Once its lysozyme activity was lost, many of the 
amino acids essential for the protein’s original function were freed from selective 
pressure, allowing rapid change, not all of it necessarily adaptive (Nitta and Sugai, 
1989; see also discussion on function switching in Chapter 7, section 7.1.3). 

The protein products of a gene duplication may sometimes be targeted to dif- 
ferent cellular locations. The human mitochondrial HMG CoA synthase 
(HMGCS2; 1p12-p13) gene encodes an enzyme which catalyses the first reaction 
of ketogenesis whilst the cytoplasmic form of the enzyme, encoded by the 
HMGCS1 gene (5p13), catalyses the second step in cholesterol biosynthesis from 
acetyl CoA. The two genes are thought to have arisen from a gene duplication 
~500 Myrs ago (Boukaftane et al., 1994). The emergence of the two enzymes as 
distinct entities served to link the pathways of B-oxidation and leucine catabolism 
and created the HMG CoA pathway of ketogenesis thereby providing a lipid- 
derived energy source. 

Gene duplication has also potentiated the emergence of isozymes, enzymes that 
catalyze the same biochemical reaction but which may differ from each other in 
terms of tissue specificity, developmental regulation or biochemical properties. 
Thus, in vertebrates, the two subunits of lactate dehydrogenase are encoded by 
two genes (LDHA; 11p14-p15 and LDHB; 12p12) and these subunits can be com- 
bined in such a way as to form five tetrameric isozymes (A,, A,B, A,B,, AB,, and 
B,) each with its own distinctive properties. 

Iwabe et al. (1996) examined gene duplications during organismal evolution 
and concluded that most gene duplications giving rise to entirely novel functions 
predated the divergence of the vertebrate and arthropod lineages. Genes which 
encode proteins that are localized to cell compartments (compartmentalized iso- 
forms) emerged by duplications which predated the separation of animals and 
fungi. By contrast, genes encoding products with virtually identical functions but 
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differing tissue distribution (tissue-specific isoforms) have undergone duplica- 
tions independently in vertebrates and arthropods after divergence of the verte- 
brate and arthropod lineages. Iwabe et al. (1996) concluded that there was a good 
correspondence between molecular evolution at the level of the gene, and tissue 
and organismal evolution. 

Several gene duplications have occurred during primate evolution. For exam- 
ple, the tandem duplication responsible for the creation of the yl- (HBG1) and y2- 
(HBG2) globin genes occurred prior to the divergence of Old World from New 
World monkeys (Fitch et al., 1991). The 5.5 kb duplicated segment is bounded by 
two related LINE elements suggesting that the duplication occurred via homolo- 
gous unequal recombination (Fitch et al., 1991). (The role of repetitive DNA 
sequences in mediating the recombinational events responsible for gene duplica- 
tions is discussed in more detail in Chapter 9, section 9.4.1). Perhaps it was the 
functional redundancy initially introduced by the duplication which created an 
opportunity for the y-globin genes to evolve a fetal function to replace their orig- 
inal embryonic function. Post-duplicational functional redundancy is sometimes 
still apparent in some systems. For instance, the MyoD family of transcription 
factors involved in myogenesis in skeletal muscle still exhibits functional redun- 
dancy (between the proteins encoded by the MYODI (11p15.1), MYF5 (chromo- 
some 12), and MYF6 (chromosome 12) genes) and this is probably indicative of 
overlapping functions (Atchley et al., 1994). 

The primate glycophorin genes (GYPA, GYPB, GYPE;; 4q28-q31) are thought 
to have emerged by duplication, mediated perhaps by recombination between Alu 
sequences (Rearden et al., 1993). GYPA probably represents the ancestral gene 
and is present in all primates studied. The GYPB gene is present in human, chim- 
panzee and gorilla but not orangutan and gibbon, whereas the GYPE gene is pre- 
sent in human, chimpanzee but intriguingly only in 7/16 gorillas tested (Rearden 
et al., 1993). The complement C4 (C4A, C4B; 6p21.3) genes and the cytochrome 
CYP21 (6p21.3) gene also emerged in the primate lineage as a result of a single 
duplication occurring prior to the divergence of the apes from the Old World 
monkeys (Horiuchi et al., 1993). 

There are several other examples of gene duplications occurring in only one 
mammalian order. For example, the bovine-specific coglutinin (Cgn1) gene is 
homologous to the human pulmonary surfactant protein D (SFTPD; 10q22-q23) 
gene and is thought to have evolved by gene duplication in the Bovidae after their 
divergence from the other mammals (Liou et al., 1994). Sometimes the number 
and distribution of Alu sequences may help the reconstruction of the phylogeny 
of a gene duplication event as in the case of the duplication of the apolipoprotein 
CI (APOC1; 19q13.2) gene, timed at about 39 Myrs ago (Raissonier, 1991). 

Some partially homologous genes may have undergone a process of duplication 
and divergence but one or other copy may have either acquired or lost DNA 
sequence at some stage thereby limiting the extent of observable homology 
between them. One example of this is the human o-N-acetylgalactosaminidase 
(NAGA; 22qter) and o-galactosidase A (GLA; Xq) genes (Wang and Desnick, 
1991). Six of the seven GLA exons were identically positioned in the NAGA gene 
but there was no similarity between the predicted amino acid sequences of GLA 
exon 7 and NAGA exons 8 and 9. 
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8.5.8 Truncated gene copies 


In some cases, only a portion of a gene may be involved in these intra-chromoso- 
mal duplication events [e.g. the polycystic kidney disease 1 (PKD1; 16p13.3) 
gene; Hughes et al., 1995]. This often results, as in the PKD1 example cited, in the 
generation of a linked pseudogene representing a partial copy of the parental 
source gene (for other examples, see Chapter 6, section 6.1.1). However, truncated 
copies need not always be inactive. Thus, the melanin-concentrating hormone 
(PMCH,; 12q23-q24) gene has become partially duplicated during primate evolu- 
tion to generate two truncated copies (PMCHLI and PMCHL2) which have been 
translocated to chromosome 5p14 and 5q12-q13 respectively (Viale et al., 1998). 
These variant gene copies possess open reading frames, are expressed in a distinc- 
tive tissue-specific fashion and, in the authors’ opinion, may represent ‘genes in 
search of a function’. 


8.5.9 Duplicational polymorphisms 


Some gene duplications occur as polymorphic variants in the human population, 
for example the red (RCP; Xq28) and green (GCP; Xq28) visual pigment (Figure 
9.6; Neitz et al., 1995; Neitz and Neitz 1995) gene. Human males with trichro- 
matic vision typically possess one RCP gene, one, two, or more GCP genes and an 
RCP/GCP hybrid gene (Figure 9.6; Drummond-Borg et al., 1989; Nathans et al., 
1986a; 1986b; Neitz and Neitz, 1995; Neitz et al., 1995). However, some males 
with apparently normal color vision can possess 4 RCP genes and 6 or 7 GCP 
genes (Neitz and Neitz, 1995). This notwithstanding, only one GCP gene is nor- 
mally expressed, probably as a result of the activity of a locus control region 
upstream of the RCP gene (Winderickx et al., 1992). 

Other examples of duplicational polymorphisms include the o1-globin (HBA1, 
16p13.3; Lie-Injo et al., 1981), C-globin (HBZ, 16p13.3; Winichagoon et al., 1982), 
Gy-globin (HBG2; 11p15; Thein et al., 1984), haptoglobin (HP; 16q22.1; Maeda 
et al., 1986) and proline-rich protein (PRBI, PRB2, PRB3, PRB4; 12p13.2; 
Lyons et al., 1988) genes. Duplicational polymorphisms may even manifest as 
gene cluster copy number variation, for example that involving the CYP21, 
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Figure 8.10. Common haplotypes of genes present on human chromosome 6p21.3 (after 
Figueroa, 1997). CYP21 Open square. CYP21P Solid square. C4B Stippled square. C4A 
Hatched square. 
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CYP2I1P, C4A and C4B genes at human chromosome 6p21.3 (Collier et al., 1989; 
Figueroa, 1997; Figure 8.10). Another example of a duplicational polymorphism is 
that of three members (one potentially functional) of the olfactory receptor 
(OLFR) gene family which occur in tandem within a block that is duplicated at 14 
different subtelomeric locations in the human genome (Trask et al., 1998a). This 
results in normal individuals possessing between 7 and 11 copies of this block in 
their genomes. Trask et al. (1998a) suggested that sub-telomeric regions could 
serve as ‘nurseries’ for the generation of diversity by promoting gene duplication. 

Other gross duplicational polymorphisms are found in the immunoglobulin V,, 
gene cluster (IGHV; 14q32): one of 50 kb in length is present in 73% of individu- 
als and results in the gain of 5 functional V,, segments (Walter et al., 1993; 
Willems van Dijk et al., 1992) whilst another of length ~80 kb is present in ~50% 
of individuals and involves the gain/loss of two functional V,, segments (Cook et 
al., 1994). A third such polymorphism involving only a single additional V,, seg- 
ment has been reported to occur in 27% of individuals (Cook and Tomlinson, 
1995). A duplicational polymorphism is also apparent in the C,, gene cluster: the 
IGHG4 gene (14q32) is duplicated in 44% of haplotypes (Brusco et al., 1997). 
Finally, the partial duplication of the GABA A receptor «5 (GABRAS) gene is 
polymorphic in that individuals differ with respect to gene copy number (Ritchie 
et al., 1998). 


8.6 Intragenic gene duplications in evolution 


Numerous examples of human genes have now come to light in which the 
encoded proteins have emerged through the introduction of individual exons or 
blocks of exons. Some of these exons encode specific protein domains (e.g. zinc 
finger, homeobox, immunoglobulin-like, epidermal growth factor-like, 
fibronectin, ABC cassette, Sushi, ankyrin, chymotrypsin etc; Doolittle, 1995; 
Henikoff et al., 1997) which have come to be distributed between a large number 
of different proteins through exon shuffling (Chapter 3, section 3.6.1). Protein evo- 
lution may however also involve the internal duplication or amplification of indi- 
vidual exons, blocks of exons or alternatively repetitive sequence motifs within 
exons. Some typical examples of such intragenic duplication events are explored 
below. 


8.6.7 Multi-exon duplications 


Some genes encode proteins that comprise two homologous domains and are 
therefore likely to have originated through a gross internal duplication. Thus, the 
17 exon human transferrin (TF; 3q21) gene is thought to have originated by an 
internal duplication which may have resulted from an unequal crossing over event 
(Park et al., 1985; Figure 8.11). Similarly, the human angiotensin I converting 
enzyme is encoded by a gene (DCP1; 17q23) which comprises 26 exons that 
encode two homologous domains each containing an active site. Exon number and 
size in the DCP/ gene are consistent with an ancient internal duplication, as are 
the codon phases at the exon-intron boundaries (Hubert et al., 1991). An ancient 
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Figure 8.11. A possible scheme for the evolution of the human transferrin (TF) gene 
(after Park et al., 1985). The ancestor of the transferrin gene was duplicated by an 
intragenic crossing over event (A) generating an internally duplicated gene which had 
lost one of its leader peptide coding exons and one of its terminal 3’ exons (B). During 
evolution, exon 4 (short arrow) was deleted. In C, the upper numbers correspond to the 
exon numbering in A and B whilst the lower ones correspond to the exons of the extant 
human transferrin gene. 


internal duplication probably also occurred in the cystic fibrosis transmembrane 
conductance regulator (CFTR; 7q31.3) gene which encodes a protein with two 
transmembrane domains and two nucleotide binding fold domains (Hughes 1994). 
By contrast, successive duplications of ancestral domains appear to have occurred 
in the human kininogen (KNG; 3q27) gene (Kellermann et al., 1986). Other exam- 
ples of human proteins displaying internal domain duplication include calbindin 
(six 43 amino acid repeats), fibronectin (twelve 40 amino acid repeats), plasmino- 
gen (five 79 amino acid repeats) and &-tropomyosin (seven 42 amino acid repeats) 
(Li 1997). Many other genes are also likely to have evolved by internal gene dupli- 
cation but the duplicated regions have probably diverged so much over evolution- 
ary time that sequence homology between them is no longer discernible. 


8.6.2 Exon duplication 


It has been estimated that at least 6% of exons in human genes have arisen by the 
duplication of pre-existing exons (Fedorov et al., 1998). One example is that of the 
human CHC]! (1p36.1) gene, which encodes a protein involved in the coupling 
between DNA replication and mitosis. This gene comprises 14 exons, eight of 
which encode the 7 tandemly repeated domains of ~60 amino acids within the 
CHC1 protein; each repeat is encoded by a single exon except for repeat IV which 
is encoded by exons 10 and 11 separated by an inserted intron (Furuno et al., 1991; 
Figure 8.12). The CHC1 gene therefore appears to have arisen through the ampli- 
fication of a primordial exon (Furuno et al., 1991). Gene construction by exon 
amplification has also been employed in the macrophage mannose receptor 
(MRC1; 10p13; Kim etal., 1992) gene; 26 of the 30 exons of the MRCI gene serve 
to encode the eight C-type carbohydrate recognition domains. Ceruloplasmin 
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Figure 8.12. Structure and evolution of the human CHC] gene (after Furuno et al., 1991). 


(CP; 3q23-q25) consists of three homologous repeat units and appears to have 
evolved by successive duplication of a primordial copper-binding protein domain 
of ~350 amino acid residues (Ortel et al., 1984). Finally, the human annexin II 
(ANX2; 15q21-q22) gene encodes a protein containing 8 copies of a conserved 70 
amino acid repeat (Spano et al., 1990). 

Exon duplication/amplification is clearly a common mechanism of evolution- 
ary change and its occurrence may often be inferred from studies of exon/intron 
distribution or from the repetitivity of protein domains (Chapter 3, section 3.6). 
Thus, the six paralogous genes of the human salivary proline-rich protein family, 
closely linked on chromosome 12p13.2, differ from each other in terms of the 
number of copies of a 63 bp tandem repeat in exons 3 of the gene: PRHI and 
PRH2 (6 repeats), PRBI and PRB3 (15 repeats), PRB2 (16 repeats) and PRB4 
(11 repeats) (Azen et al., 1987; Kim et al., 1993). Examples of orthologous gene 
pairs in which only one orthologue manifests exon duplication/amplification are 
somewhat rarer. The oxidized low density lipoprotein receptor 1 (OLR1; 12p12.3- 
p13.1) gene serves to illustrate the principle; the rat Olr] gene encodes three 46 
amino acid repeats between the transmembrane and lectin-like domains of the 
LOX-1 protein, whilst the human and bovine OLRI/ genes encode only one such 
repeat (Nagase et al., 1998). 

The primate semenogelin genes (SEMG1, SEMG2) which, in human, are 
closely linked on 20q12-q13.1, encode the major protein constituents of the sem- 
inal fluid. These genes differ however between species as a result of internal dupli- 
cations of ~180 bp segments encoding 60 amino acid repeats. Human 
semenogelin II contains two fewer repeats than rhesus monkey semenogelin II 
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which appears to contain two fewer repeats than baboon semenogelin II 
(Ulvsback and Lundwall, 1997). The primate semenogelin genes arose by dupli- 
cation of an ancestral gene about 60 Myrs ago (Lundwall, 1996) and the human 
semenogelin I gene appears to have lost two 60 amino acid repeats by comparison 
with the paralogous semenogelin II gene (Ulvsback and Lundwall, 1997). The 
semenogelin I protein from cotton-top tamarin (Saguinus oedipus) possesses three 
additional repeats as compared to human and, intriguingly, possesses 14 potential 
glycosylation sites not present in the human protein (Lundwall, 1998). 

Most known genes manifesting internal duplication exhibit strong conserva- 
tion of intron positions between the duplicated domains (Chapter 3, sections 3.1 
and 3.6.2). One notable exception, however, is the rabbit phosphofructokinase 
gene (Lee et al., 1987). Sequence homologies between bacterial and rabbit phos- 
phofructokinases and between the amino and carboxy terminal ends of the rabbit 
enzyme are consistent with the origin of this gene being via a process of internal 
duplication and divergence. However, intron positions are not conserved even 
between the two halves of the rabbit gene. 

Exon duplication may potentiate alternative splicing. Thus, the human keto- 
hexokinase (KHK; 2p23) gene contains two very similar 135 bp exons (termed 3a 
and 3c) which are mutually exclusively spliced into KHK mRNA (Hayward and 
Bonthron, 1998). Both the exon-intron structure and the pattern of alternative 
splicing are conserved between human, rat and mouse, consistent with the exis- 
tence of two evolutionarily conserved KHK isoforms. This alternative splicing 
event is also tissue-specific since, in both rat and human, those tissues that 
express high levels of KHK incorporate exon 3c whereas other tissues incorporate 
exon 3a. Interestingly, a shift in splicing choice from exon 3a to 3c appears to 
occur during development between the human fetus and adult (Hayward and 
Bonthron, 1998). 


8.6.3 Intra-exonic duplications 


Internal gene duplications may sometimes be quite subtle. Thus, the 3’ untrans- 
lated regions of the human and monkey cytochrome c oxidase subunit II 
(MTCO2; mitochondrial genome) genes have been generated by duplication 
events involving a 13 bp region that occurred during primate evolution 
(Ramharack and Deeley, 1987). The RNA derived from the MTCO2 gene has the 
potential to form stable stem-loop structures in a region immediately preceding 
the duplication site and these inverted repeat sequences may have played a role in 
promoting the duplicational events. 

In some cases, homogenization of internal repeats has occurred as for example 
with the complement control protein repeats of the baboon and human comple- 
ment receptor type 1 (CRI; 1q32; Clemenza et al., 1997; Hourcade et al., 1990) 
genes. This process has been termed horizontal or concerted evolution and the mech- 
anism involved is likely to be either unequal crossing over or gene conversion. 
Whichever mechanism is responsible, the following example suggests that intra- 
genic homogenization may be a less efficient process than intergenic homoge- 
nization. The human small proline-rich (SPRR) proteins, induced during 
keratinocyte differentiation, are encoded by two SPRRI1 genes (A and B), approx- 
imately seven SPRR2 genes and a single SPRR3 gene, closely linked on chromo- 
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some 1q21-q22 (Gibbs et al., 1993). The central segments of the encoded polypep- 
tides are composed of tandemly repeated units of either eight or nine amino acids. 
Thus the consensus octamer PKVPEPCH is found 6 times in SPRRIA and 
SPRRIB, the nonamer PKCPEPCPP three times in SPRR2 and the octamer 
TKVPEPGC 14 times in SPRR3. This is consistent with a process of internal 
duplication that began after the divergence of the SPRR genes into three distinct 
subfamilies. It is evident that, during the evolution of the SPRR gene family, 
there has been a bias toward either intragenic or intergenic duplications. Thus in 
SPRR2, intergenic recombination has occurred more frequently than intragenic 
recombination since there are seven SPRR2 genes each with three repeats. By 
contrast, in SPRR3, no intergenic recombination has occurred but intragenic 
recombination has occurred frequently to generate 14 repeat copies. Interestingly, 
the percentage of amino acid conservation (relative to the consensus for each type 
of gene) is significantly higher for the SPRR2 repeats than for either SPRR1 or 
SPRR3 suggesting that intergenic homogenization may be more effective than 
intragenic homogenization. 

Concerted evolution is not however an obligatory property of internally repeti- 
tive proteins. Take for example the case of the a- and B-spectrins. Spectrin is a red 
blood cell cytoskeletal component which consists of a tetramer of two antiparallel 
ab spectrin dimers. The o-spectrin (SPTA1; 1q21) and B-spectrin (SPTB; 14q22- 
q23) genes evolved by duplication of a common ancestral gene which existed 
before the divergence of the vertebrate and arthropod lineages ~600 Myrs ago. 
The structure of both protein subunits is consistent with successive intragenic 
duplications. The a- and B-subunits consist of tandemly repeated segments of 106 
amino acids of which 20 occur in a-spectrin, 17 in B-spectrin. Although the a- 
spectrin segments appear to have evolved in homogeneous fashion, the B-spectrin 
segments exhibit considerable heterogeneity (Muse et al., 1997; Thomas et al., 
1997). One explanation for this difference is that some segments with specific 
functions may have evolved differently from others. Indeed, on the basis of the 
similar locations of the heterogeneous a- and f-spectrin segments, Muse et al. 
(1997) suggested that the a- and B-spectrins have co-evolved, and those segments 
that are intimately involved in subunit dimerization have been evolutionarily 
constrained by the structures of their binding partners. Muse et al. (1997) found 
no evidence for interrepeat exchanges and therefore concluded that neither gene 
conversion nor recombination had operated. At some stage, probably before the 
divergence of arthropods and vertebrates, concerted evolution may have operated 
but this initial phase probably ceased as the individual segments began to diverge 
at the DNA level (Thomas et al., 1997). 


8.6.4 The emergence of primordial genes by oligomer duplication 


It is possible that primordial coding sequences emerged by a process of sequential 
duplication of base oligomers to yield ‘genes’ encoding polypeptides with signifi- 
cant periodicity (Ohno, 1984; Trifonov and Bettecken, 1997). Ohno (1987) cites 
the example of a putative primordial rhodopsin gene, which gave rise to this 
ancient family of seven transmembrane domain-containing proteins including 
among others, bacterial rhodopsin and vertebrate retinal opsin, B,-adrenergic 
receptor and muscarinic acetylcholine receptor. This primordial gene originally 
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contained CCTGCTG, CCTGGCC and GCTGGCC heptameric repeats. Ohno 
(1987) showed that the gene encoding porcine muscarinic acetylcholine receptor 
still contains many of these oligomeric repeats. The original heptameric repeats 
appear to be more stringently conserved in those portions of the gene encoding 
the seven transmembrane domains whereas new repeat units comingle with old 
repeats in those portions that encode the extracellular and intra-cytoplasmic 
domains. 


8.6.5 Intragenic duplicational polymorphisms 


Partial internal duplications of genes can also occur as polymorphic variants. A 
particularly dramatic example is provided by the human apolipoprotein(a) (LPA; 
6q27) gene. This gene possesses at least 34 different alleles containing a variable 
number (between 12 and 51) of tandemly repeated kringle IV-encoding domains. 
Each allele gives rise to a different apolipoprotein(a) isoform thereby explaining 
the size polymorphism of the protein: 300-800 kDa (Lackner et al., 1993; 
reviewed by Scanu and Edelstein, 1995). The LPA gene has been described as a 
‘plasminogen gene gone awry’; an apt description in that the closely linked and 
evolutionarily related plasminogen (PLG; 6q27) gene encodes a protein with only 
5 kringles (Ichinose, 1992; Magnaghi et al., 1994). The LPA gene is thought to 
have arisen between 40 Myrs (McLean et al., 1987) and 90 Myrs ago (Pesole et al., 
1994) during the adaptive radiation of the mammals. At least ten kringle IV- 
encoding domains are present in rhesus macaque suggesting that kringle number 
may have expanded progressively during primate evolution (Pesole et al., 1994). 
Although the LPA gene was originally thought to be confined exclusively to pri- 
mates, it has also been found in hedgehogs in which it possesses 31 kringle-encod- 
ing domains (Lawn et al., 1997). Since these are kringle III repeats, we must 
surmise that an apolipoprotein(a)-like gene arose independently from a plas- 
minogen-like gene in this insectivore (~80 Myrs ago) and experienced a similar 
expansion of a subset of its kringle repeats to that found in humans — a remarkable 
example of convergent evolution. In humans, the LPA kringle repeat number poly- 
morphism may not be without clinical significance because (i) there is an inverse 
relationship between apolipoprotein(a) isoform size and plasma apolipoprotein(a) 
levels (van der Hoek et al., 1993) and (ii) elevated levels of apolipoprotein(a) are 
associated with an increased risk of atherosclerosis and cardiovascular disease 
(Byrne and Lawn, 1994). 

Several human mucin genes exhibit highly polymorphic tandemly repetitive 
regions, for example MUCI (1q21; Gengler et al., 1990), MUC2 (11p15; Toribara 
et al., 1991), and MUC4 (3q29; Nollet et al., 1998). Intragenic repeat copy number 
polymorphisms are also evident in the human proline-rich protein (PRH1; 
12p13.2) gene where alleles vary in terms of the number of copies of a 63 bp repeat 
in exon 3 (Azen et al., 1987; Kim et al., 1993) and the complement receptor type 1 
(CR1; 1q32) gene where the two most common alleles differ in terms of the pres- 
ence of a long homologous repeat of ~450 amino acids (Wong et al., 1989). Finally, 
in the human filaggrin (FLG; 1q21) gene, a length polymorphism of 10, 11, or 12 
copies of a 972 bp repeat occurs within its polyprotein precursor which is subse- 
quently cleaved into individual functional filaggrin molecules (Gan et al., 1991). 
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8.7 Coding sequence expansion and contraction resulting 
from the introduction or removal of initiation and 
termination codons 


The mutation of initiation codons leading to the extension of the protein coding 
sequence of a gene is known to be a cause, albeit an infrequent one, of human 
genetic disease (Cooper and Krawczak, 1993). Such mutations have however also 
occurred on a number of occasions during gene evolution. Thus, an ATG->GTG 
substitution occurred in the Met initiator codon of the ZNF80 (3q13.3) gene in 
the common ancestor of African green monkey and rhesus macaque which 
resulted in a change in the site of translational initiation to a location 20 codons 
amino terminal to the Met initiator codon used by humans, chimpanzees and 
gorillas (Di Cristofano et al., 1995). However, in African green monkey, the 
ZNF80 protein is only 213 amino acid residues in length (as compared to 273 
residues in humans and the great apes and 293 residues in rhesus macaque) owing 
to truncation of the protein due to an additional GAG->TAG substitution intro- 
ducing a novel termination codon (Di Cristofano et al., 1995). 

Another example of mutation resulting in the species-specific use of alternative 
initiation codons is provided by an ATG—ATA transition in the human hepatic 
peroxisomal L-alanine: glyoxylate aminotransferase 1 (AGXT; 2q36-q37) gene 
(Takada et al., 1990). This lesion removed the original initiation codon and an 
alternative downstream initiation codon is used in the translation of the human 
AGXT transcript. The rat Agxt orthologue encodes a protein which, by compari- 
son with the human protein, possesses an extra 22 amino terminal amino acids 
specifying a leader sequence thought to contain a mitochondrial targeting signal 
that is absent in the human protein. 

The mutational removal of a termination codon may also lead to the elongation 
of a protein product. One example of this is provided by the porcine P-glycopro- 
tein genes whose coding regions may be seen to have been extended by such a 
mutation when compared to the human orthologous proteins (PGY/, PGY3; 
7q21; Childs and Ling, 1996). 


8.8 Minisatellites, microsatellites and telomeric repeats 


Repetition is the only form of permanence that nature can achieve. 
George Santayana (1922) 


A detailed discussion of the evolution of the various families of satellite, min- 
isatellite and microsatellite DNA would be somewhat tangential to the remit of 
this volume and would merely serve to recapitulate a previous volume in this 
series. However, a brief resumé will be given to provide the necessary background 
to guide the interested reader to the relevant literature. 


8.8.1 Minisatellite DNA sequences 


Minisatellite DNA sequences occur in all eukaryotes from yeast to human (Haber 
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and Louis, 1998). Minisatellites frequently exhibit substantial allelic variability 
with respect to repeat number (Jeffreys et al., 1990; reviewed by Armour, 1996) 
and allele length analysis has demonstrated germline mutation rates as high as 
15% per gamete (Jeffreys et al., 1988; Jeffreys, 1997). Minisatellite mutation may 
involve intra-allelic rearrangements whose frequency, unlike inter-allelic 
rearrangements, is influenced by the size of the tandem array (Buard et al., 1998). 
Sequence similarities, manifested by a subset of minisatellites, to the Chi recom- 
bination promoting element of E. coli have led to the suggestion that this ‘core 
sequence’ might be recombinogenic and could serve to promote unequal crossing 
over. However, analysis of flanking polymorphisms has not indicated the 
exchange of markers predicted for the products of unequal exchange between alle- 
les (Wolff et al., 1988, 1989). This notwithstanding, recombination hotspots some- 
times co-localize with minisatellites (Jeffreys et al., 1998a) which has led to the 
suggestion that minisatellite instability may be a by-product of meiotic recombi- 
nation (Jeffreys et al., 1998b). 

Monckton et al. (1994) have shown that minisatellite mutation can involve 
complex inter-allelic gene conversion events. These may exhibit polarity since the 
gain of a few repeat units appears to be confined to one end of the tandem repeat 
array (Jeffreys et al., 1994). One alternative proposal is that minisatellite mutation 
may involve an array homogenization process which could operate by biased 
repair of intra-helical (slippage) or inter-helical (unequal sister chromatid 
exchange) heteroduplexes (Bouzekri et al., 1998). Whatever the mechanism(s) 
underlining their allelic variability, the rapid evolution of minisatellites appears 
to have rendered them an important substrate for the opportunistic processes of 
molecular evolution. Indeed, several are known to have become recruited as gene 
regulatory elements (see Chapter 5, section 5.1.12, Minisatellites and microsatellites). 


8.8.2 Microsatellite DNA sequences 


Microsatellites typically mutate with a frequency of 10? to 10+ although mutation 
rates vary between loci by several orders of magnitude (Brinkman et al., 1998; 
Crawford and Cuthbertson 1996). Microsatellite mutation events in the male 
germline outnumber those in the female germline 5- to 6-fold (Brinkman et al., 
1998). Interestingly, alternative alleles at the same locus may differ dramatically 
in terms of their mutation rate (Jin et al., 1996). By contrast, microsatellite loca- 
tion is often conserved at orthologous positions in the genomes of different 
species (Blanquer-Maumont and Crouau-Roy, 1995; Liao and Weiner, 1995; 
Meyer et al., 1995; Moore et al., 1991; Pausova et al., 1995; Stallings et al., 1991; 
Stallings, 1994, 1995; Sun and Kirkpatrick, 1996). Rubinsztein et al. (1995a,b) 
compared allele length distributions for a considerable number of human 
microsatellites with their orthologues in chimpanzees, gorillas, orangutans, 
baboons, and macaques and claimed a tendency for the loci to be longer in 
humans. Although some microsatellites are more variable in nonhuman primates 
than in humans (Kayser et al., 1995), a tendency for the loci to be longer in humans 
would be consistent with the directionality of microsatellite evolution and com- 
patible with evolution proceeding at different rates in different species. However, 
it could also be due to ascertainment bias in that it is precisely the longest, most 
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mutable and therefore most informative of human microsatellites that have been 
selected as genetic markers (Ellegren et al., 1995). In principle, the large size of the 
human population could have compounded this effect since it could support a 
higher level of genetic diversity; in smaller populations, much variability is lost as 
alleles go to fixation. In practice, however, it would appear as if ascertainment bias 
cannot be the sole explanation for inter-specific differences in microsatellite repeat 
length (Cooper et al., 1998). Thus, the tendency for microsatellite loci to be longer 
in humans may not simply be an experimental artefact. 


8.8.3 Telomeric and centromeric repetitive DNA 


Telomeric TTAGGG repeat number varies not only between human chromoso- 
mal arms but also between individuals (Brown et al., 1990; Martens et al., 1998). 
Indeed, there is some evidence in humans for the exchange of telomeric and sub- 
telomeric repeats between nonhomologous chromosomal ends (van Deutekom et 
al., 1996). Since satellite DNA sequences at the telomeric junctions of chim- 
panzees do not show any similarity to their counterparts in human or orangutan, 
it is likely that telomeres became reorganized relatively recently during primate 
evolution (Baird and Royle, 1997; Royle et al., 1994; Royle, 1996). 

Alphoid satellite DNA is found as a tandem repeat in long chromosome-spe- 
cific arrays at the centromeres of all primate chromosomes (Warburton and 
Willard, 1996). These arrays are highly variable within and between homologous 
chromosomes in the same species. 


8.9 Expansion of unstable repeat sequences 


DNA sequences that are internally repetitive are particularly prone to misalign- 
ment during DNA synthesis (Djian, 1998; Di Rienzo et al., 1994; Levinson and 
Gutman, 1987; Schl6tterer and Tautz, 1992; Valdes et al., 1993). If such misalign- 
ment takes place, the nascent strand can slip back in multiples of the repeat unit 
and the newly synthesized DNA strand will be elongated by comparison with the 
parental strand. Since many coding sequences contain simple sequence repeats 
(Tautz et al., 1986) (for example, the genes encoding the 28S and 18S ribosomal 
RNAs (RNRI-5; Hancock, 1995a; Hancock and Dover, 1988) and the TATA- 
binding protein TBP (TBP; Hancock, 1993)), repeat expansion may have been 
involved in the evolution of these sequences (Hancock 1995b; see Section 8.9.2). 
However, these simple repeats also have the potential to be involved in disease 
pathogenesis (see Section 8.9.1) and it is from studies of genetic disease that most 
of our knowledge of triplet repeat expansion is derived. 


8.9.1 Triplet repeat expansion disorders 


One manifestation of DNA slippage involving simple repetitive sequences is the 
instability of certain trinucleotide repeat sequences (reviewed by Djian, 1998; 
Monckton and Caskey, 1995; Richards and Sutherland, 1996; Sutherland and 
Richards, 1995; Timchenko and Caskey, 1996; Wells, 1996). This mutational 
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mechanism underlies fragile X mental retardation syndrome, a condition associ- 
ated with the presence of a fragile site on the X chromosome (FRAXA). The 
brain-expressed FMRI (Xq27.3) gene responsible contains an unusual (CGG)n 
repeat in its 5’ untranslated region. This repeat exhibits copy number variation of 
between 6 and 54 in normal healthy controls, between 52 and >200 in phenotyp- 
ically normal transmitting males (the ‘premutation’) and between 300 and > 1000 
in affected males (the ‘full mutation’) (Verkerk et al., 1991; Fu et al., 1991). Thus a 
continuum exists between a copy number polymorphism present in the general 
population, the asymptomatic premutation which involves limited expansion of 
(CGG)n copy number, and the full mutation which appears to require copy 
number expansion beyond a certain threshold value. Expansion of premutations 
to full mutations is thought to be a prezygotic event (Moutou et al., 1997) and 
occurs only during female meiotic transmission. Alleles with a repeat copy num- 
ber of <46 do not exhibit elevated meiotic instability. By contrast, for alleles with 
52-113 repeat copies, the premutation expands to the full mutation in 70% of 
transmissions whereas the corresponding figure for alleles with >90 repeat copies 
is 100%. The probability of repeat expansion thus correlates with the repeat copy 
number in the premutation allele, consistent with a mechanism of slipped mis- 
pairing during replication. Expansion of a sequence can thus itself lead to further 
expansion, a process termed ‘dynamic mutation’ by Richards et al. (1992). In 
FRAXA, triplet repeat expansion is thought to exert its pathological effects by 
down-regulation of FMRI gene expression through hypermethylation of the pro- 
moter region upstream of the CGG repeat and repression of translation of the 
FMRI transcript. 

The discovery of this novel mutational mechanism has led to the recognition 
that the expansion of unstable triplet repeats is also responsible for a number of 
other human inherited diseases (Table 8.2). These often manifest a wide range of 
clinical severity and possess unusual features such as increasing severity and pen- 
etrance in successive generations (‘anticipation’) and a sex bias in the transmis- 
sion of the disease which correlates with the degree of meiotic instability and 
allelic expansion. As can be seen from Table 8.2, the nature and location of the 
repeat sequence involved varies between disease states as does the extent of the 
expansion necessary to bring about symptoms of disease, and of course the mech- 
anism of pathogenesis consequent to repeat expansion. It should however be 
noted that repeat expansion is not confined to triplet repeats; minisatellite expan- 
sion can also occur as in progressive myoclonus epilepsy and the clinically asymp- 
tomatic fragile sites FRA16A and FRAI6B (Table 8.2). 

Triplet repeat expansions have so far been found to be associated mainly with 
three types of sequence: CAG (with complement CTG), CGG (with complement 
CCG) and GAA (with complement TTC). Sequence specificity implies a role for 
DNA secondary structure in the expansion mechanism. Such repeats are known 
to form stable hairpin loop structures (Chen et al., 1995; Gacy et al., 1995; Mitas et 
al., 1995) which become more stable with increasing repeat number. DNA poly- 
merase progression appears to be blocked by CTG and CGG repeats and the resul- 
tant idling of the polymerase may serve to catalyze slippage leading to repeat 
expansion (Kang et al., 1995). 

A number of diseases are characterized by CAG repeat expansion within the 
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gene coding region (Table 8.2). This is thought to constitute a gain-of-function 
mutation through the incorporation ofa polyglutamine tract into the protein prod- 
uct (Houseman, 1995). Several factors are known to influence the stability of triplet 
repeats viz. the type of sequence, length of repeat, whether the repeat is interrupted 
or not, and the orientation of the repeat relative to the origin of replication 
(Andrew et al., 1997). The removal of interrupting point mutations from repeat 
arrays is thought to be very important in promoting triplet repeat expansion 
(Eichler et al., 1994) and is significant in an evolutionary context as well as in cases 
of disease. Removal of these point mutations could occur simply by single base- 
pair substitution or by gene conversion or unequal crossing over. Other inherited 
conditions result from expansion of a triplet repeat in the 3’ untranslated region 
(myotonic dystrophy; DMPK) and an intron (Friedreich ataxia; FRDA) (Table 8.2) 
and the resulting mechanisms of disease are consequently different. 


8.9.2 Nature and distribution of triplet repeats in the human genome 


Trinucleotide repeats are 10—100-fold less frequent than (AC)n repeats (Gastier et 
al., 1995) and different types of trinucleotide repeat occur at different frequencies. 
Thus repeats (AAT)n and (AAC)n are the most frequent trinucleotide repeats 
found in the human genome (Gastier et al., 1995; Stallings 1994), with (AAT)n 
exhibiting a high degree of copy number polymorphism (Gastier et al., 1995). 
Both (CAG)n and (CCG)n repeats appear to be over-represented in the human 
genome (Han et al., 1994), the former being polymorphic in number in the 
genomes of humans and non-human primates (Sirugo et al., 1997). (CAG)n 
repeats are rare in intronic regions, possibly because of their similarity to the 
acceptor splice site consensus, CAGG (Stallings, 1994). Long homopeptides are 
present in 1.7% of human protein-coding sequences (Karlin and Burge, 1996). 

A number of human genes contain polymorphic triplet repeats (e.g. cadherin 2 
(CDH; 18q12), breakpoint cluster region (BCR; 22q11), glutathione-S-trans- 
ferase (GSTA1; 6p12), Na*/K* ATPase B1-subunit (ATPIB1; 1q22-q25) but these 
repeats are not known to exert any pathological effect (Li et al., 1993; Riggins et 
al., 1992). Other genes have been identified solely on account of their possession 
of polymorphic trinucleotide repeats and these genes represent candidate loci for 
involvement in complex diseases (Breschel et al., 1997; Néri et al., 1996). 
Trinucleotide repeat containing genes are also found in other species, including 
mouse (Kim et al., 1997), but no nonhuman example of triplet repeat expansion as 
a cause of a genetic disease has yet been documented. 

Replication slippage involving short GC-rich motifs (“expansion segments’) has 
occurred during the evolution of the vertebrate genes encoding the 28S and 18S 
ribosomal RNAs (Hancock, 1995a; Hancock and Dover, 1988). Interestingly, dif- 
ferent segments of the 28S rRNA subunit gene appear to have coevolved by ‘com- 
pensatory slippage’ allowing RNA secondary structure to be conserved as a 
consequence of runs of sequence motifs in one region being compensated for by 
complementary motifs in another (Hancock and Dover, 1990). 


8.9.3 Origin of expanded triplet repeats 


In myotonic dystrophy, all Caucasian and Japanese DM chromosomes possess a 
specific haplotype (Deka et al., 1996; Imbert et al., 1993; Tishkoff et al., 1998; 
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Yamagata et al., 1996), whilst a second disease-associated haplotype has been 
found in Africans (Krahe et al., 1995). Similarly, haplotype analysis has pointed to 
a single origin for Japanese and Caucasian Machado-Joseph disease chromosomes 
(Takiyama et al., 1995), a single origin for Japanese and Caucasian dentatorubral 
and pallidoluysian atrophy chromosomes (Yanagisawa et al., 1996), a single origin 
for the Friedreich ataxia expansion (Cossée et al., 1997), at least two origins for the 
Huntington disease expanded repeat (Squitieri et al., 1994) and a small number of 
FRAXA progenitor chromosomes (Hirst et al., 1994). 

One of the best examples of the evolutionary emergence of an expanded triplet 
repeat is that found in the SRP14 gene (15q22) encoding the 14 kDa Alu RNA- 
binding protein (Chang et al., 1995). The human protein is larger than that of 
mouse and dog on account of an extra 28 residue alanine-rich C-terminal tail 
which is translated from a 3’ GCA-rich trinucleotide repeat. In the prosimian 
Galago, the relevant sequence at the site of the human triplet repeat is GCA GCA, 
whereas the mouse possesses the sequence CCA GCA. By contrast, the African 
green monkey possesses 33 GCA repeats, whilst the owl monkey possesses 52 sug- 
gesting that a CCA-GCA substitution occurred in an ancestral prosimian 
thereby creating two consecutive GCA codons which then facilitated GCA expan- 
sion in higher primates. Interestingly, however, no intra-specific variability in 
repeat size was detected in primates. 


8.9.4 Evolution of repeat number in the genes underlying disorders of 
triplet repeat expansion 


The normal range for CAG repeat number in the HD gene is 8-35. Although 
modal CAG repeat length is fairly similar between different human populations, 
the degree of spread varies, being greatest among Africans and lowest among the 
Japanese (Rubinsztein et al., 1994; Watkins et al., 1995). The breadth of this nor- 
mal range suggests that natural selection is acting weakly if at all on HD alleles 
below the disease threshold. However, the population distribution of CAG repeat 
number in the HD gene exhibits an apparent asymmetry in that more alleles lie 
above rather than below the modal length. Using computer simulations, 
Rubinsztein et al. (1994) have shown that the distribution of HD alleles in human 
populations is explicable in terms of a simple length-dependent mutational bias. 
The observed distribution of alleles is thus explicable merely in terms of mutation 
and genetic drift thereby obviating the need to invoke positive selection 
(Rubinsztein et al., 1994). It may be, however, that coding sequences can tolerate 
CAG repeat-encoded polyglutamine tracts relatively well, thereby minimizing the 
effect of negative selection (Green and Wang, 1994). More controversially, in the 
case of myotonic dystrophy and Machado-Joseph disease, meiotic drive (also 
termed segregation distortion), the excess recovery of one of a pair of alleles in the 
gametes of an heterozygous parent, has been proposed as being responsible for 
maintaining the frequency in the population of chromosomes bearing triplet 
repeats capable of expansion into the disease range (Chakraborty et al., 1996; 
Leeflang et al., 1996; Takiyama et al., 1997). 

Djian et al. (1996) examined the disease-associated CAG repeats in the HD, 
M7D, AR, and SCA] orthologues of various nonhuman primate species. For the 
HD and MJD genes, CAG copy number was polymorphic in the nonhuman pri- 
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mates but there was essentially no overlap with the normal human range, findings 
also reported by Limprasert et al. (1996). The AR gene was also found to be poly- 
morphic in nonhuman primates but with minimal overlap with the normal 
human range (Djian et al., 1996). For the SCA/ gene, the average repeat copy 
numbers exhibited by Pan, Gorilla, Hylobates, Macaca, and Cercopithecus were 
within the normal human range but were lower than the modal number in 
humans (Djian et al., 1996). Thus, for these four loci, CAG copy number is lower 
in nonhuman primates than in humans. From what we know of the propensity of 
CAG repeats to expand, this is much more likely to be due to a higher rate of 
expansion in humans than the alternative: contraction in the other primate 
species. In the case of the HD gene, this interpretation is supported by the finding 
of a similar number of CAG repeats in the murine Hd gene to that found in the 
nonhuman primate HD genes (Lin et al., 1994). The high number of CAG repeats 
in the human HD gene is therefore due primarily to expansion within the human 
lineage. This contrasts with the expansion of CAG number in the AR gene which 
began earlier in hominoid evolution; the great apes contain an expanded CAG 
repeat not found in rodents (Choong et al., 1998; Djian et al., 1996). Thus, the AR 
CAG repeat expansion began prior to ape-human divergence but continued in 
human after divergence from the great apes. The most dramatic example of CAG 
expansion is however that of the involucrin (IVL; 1q21) gene which is so rich in 
CAG repeats and codons derived from CAG that it is likely that it is descended 
from a simple poly (CAG) sequence (see Section 8.9.5 below). 

The CGG repeat in the 5'UTR of the FMRI gene has been studied in 44 mam- 
malian species from 8 different orders (Eichler et al., 1995). The presence of this 
repeat in all species examined indicates that the CGG repeat has been conserved 
for over 150 Myrs and is therefore likely to be have some functional significance 
(reviewed by Eichler and Nelson, 1998). Repeat length was found to be similar 
among the 24 nonprimate species, ranging from 4 to 12 units (mean 8.01 + 0.8). 
By contrast, the mean length of the repeat among the 20 primate species examined 
was 20.1 + 2.3. Copy number polymorphism was not found to be limited to 
human, with Ornythorhyncus (platypus), Artibeus (phyllostomid bat) and Pan 
(chimpanzee) all possessing polymorphic repeats. Parsimony analysis predicted 
that the early mammalian CGG repeat was short (4-9 units) and uninterrupted, 
whereas an increase in copy number beyond ~20 repeats appears to have occurred 
at least three times independently in the Catarrhini (Figure 8.13). These expan- 
sions in the hylobatid apes, great apes and the cercopithecoid monkeys were asso- 
ciated with the addition of specific interspersions, CGA, AGG and CGGG 
respectively (Figure 8.13). These interspersions are unlikely to have arisen 
through DNA polymerase slippage and may have been mediated by unequal 
crossing over or gene conversion. 

A comparison of trinucleotide repeats in human versus rodent coding 
sequences has indicated that, by and large, orthologous repeats have not been con- 
served for long periods of evolutionary time, either in terms of their size or loca- 
tion (Stallings, 1994). As yet, there are no known pathological equivalents of 
human dynamic mutations in other species but this may merely reflect bias of 
ascertainment. Thus, nonhuman examples of disease-associated triplet repeat 
expansions may not yet have come to our attention and it may be that these expan- 
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sions have no counterpart in the human orthologues. Similarly, the observation 
that triplet repeats associated with human disease are sometimes found to be both 
polymorphic and expanded beyond the mammalian ancestral state in non-human 
primates may also be viewed in terms of a bias of ascertainment rather than any 
intrinsic propensity of triplet repeats to expand in primate genomes. Finally, not- 
ing that most of the expanded triplet repeats are found in genes expressed in the 
nervous system, Hancock (1996) proposed that triplet repeat expansion may have 
occurred in parallel with the increase in brain complexity so characteristic of the 
human lineage and may even have been instrumental in bringing about this 
increase in complexity. 


8.9.5 The unique case of involucrin 


Involucrin is a unique protein in that it has evolved very rapidly in the primate 
lineage through changes in the length and copy number of tandem repeats within 
the coding region with the result that a full two thirds of its coding region has 
been created within the anthropoid lineage (Green and Djian, 1992). The evolu- 
tion of this gene/protein is therefore worth examining in some detail. 

Involucrin is the most abundant protein component of the keratinocyte enve- 
lope and is cross-linked to membrane proteins by a transglutaminase. The coding 
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Figure 8.14. Proposed scheme for the evolution of the human involucrin (IVL) gene. 
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region of the human involucrin (IVL; 1q21) gene lies within a single exon and 
encodes a 585 amino acid glutamine-rich protein (Eckert and Green, 1986). The 
human IVL gene is extemely rich in CAG repeats (encoding glutamine) and 
codons derived from it (GAG and CTG). Thus, CAG codons, and codons one sub- 
stitution removed from CAG, comprise 64% of the total codon number. It is pos- 
sible therefore that this gene is descended from a simple poly(CAG) sequence 
which has been subsequently modified by nucleotide substitution (Tseng, 1997; 
Figure 8.14). 

The time of origin of the involucrin gene is unclear but the gene is present in a 
wide variety of mammals and may be present in all terrestrial vertebrates. The 
length of the involucrin molecule has continued to grow by successive addition of 
repeats from prosimians (average 409 residues) through New World monkeys 
(average 488 residues) and Old World monkeys (average 514 residues) to homi- 
noids (average 632 residues). The coding region outside the segment of repeats 
has by contrast changed very little in length. Since glutamine residues serve as 
amine acceptors in the transglutaminase-catalyzed cross-linking, the primordial 


ATG 
l P M — 
Anthropoids 80-82 6 65-66 232-637 43-46 
Tarsioids 75 242 9 17 44 
Non-primates 77-83 94-255 64-71 6-10 0-40 


and prosimians 


Figure 8.15. Coding regions of primate involucrin genes illustrating the relative locations 
of repetitive regions P and M. Numbers given denote lengths (in codons) of segments of 
repeats and non-repetitive regions (redrawn from Green and Djian, 1992). 


Number of 
altered 
Repeat codons 
Number Type 1 2 3 4 5 6 7 8 9 10 per repeat 
10 A K H L E Q @ EB G@ Q L 5 
9 B E; P E a Q v G Qa F 10 
8 A Ki L E Q Ë Ẹ K Q L 10 
7 B E P E Q @ E G Q it} 3 
6 A K L Ë K Q Ë A Q L 2 
5 B E P: E Q av Œ a P 3 
4 A K E Q Q E K Q L 2 
3 B E E a Q@ E£& lig Q L 7 
2 A K E Q Q| E G Q L 6 
1 A K E Q Q Ẹ G Q L 10 
Mean 5.8 


Figure 8.16. Consensus amino acid sequences (from 12 anthropoid ape species) for each 
of the 10 repeats of the early region of the M segment of involucrin. Alterations in one or 
more species are boxed or framed. Solid lines indicate deletions whilst dashed lines 
indicate amino acid substitutions. Over half of the amino acids have been deleted or 
replaced in one or more species. The doubly underlined Q in repeat 5 is the preferred 
residue for cross-linking in human involucrin (redrawn from Green and Djian, 1992). 
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Figure 8.17. Evolutionary tree of the M region of the primate involucrin gene detailing 
repeat additions (after Green and Djian, 1992). The numbers along the branches of the 
tree indicate the number of repeats added to form each region. Four late additions within 
the middle or early regions and one 3’ of the early region (boxed) are deviations from the 
usual vectorial pattern of repeat addition. The deletions assigned to the region, the 
lineage in which they occurred, and the number of repeats deleted are given in 
parentheses. Ppy: Pongo pygmaeus. Hsa: Homo sapiens. Mmu: Macaca mulatta. Cal: Cebus 
albifrons. Soe: Saguinus oedipus. Ppa: Pan paniscus. Hla: Hylobates lar. Cae: Cercopithecus 
aethiops. Atr: Aotus trivirgatus. Ggo: Gorilla gorilla. Mfa: Macaca fascicularis. Cha: 
Cercopithecus hamlyni. 


glutamine-rich involucrin may have been able to function as a substrate for trans- 
glutaminase. 

Involucrin gene sequences have now been determined in a range of primate and 
nonprimate species. Two different sites have been found within the coding region 
that contain variable numbers of repeat segments. One, at site B lies ~80 codons 
from the ATG, contains a 16-amino acid repeat and is evident in nonprimate 
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mammals, prosimians and tarsioids (Figure 8.15). The other, at site M, containing 
a 10-amino acid repeat, is located a further 68 codons C-terminal to site P and has 
increased in size in the anthropoid apes concomitant with a reduction in size of 
the P repeat segment (Figure 8.15). 

The involucrin genes of the dog, pig and mouse have 6, 13, and 21 repeats, 
respectively at site P whilst the lemur (Lemur catta) and tarsier (Tarsius bancanus) 
possess 19 and 18, respectively. Within each species, however, P repeat segments 
differ; thus three codons have been deleted in eight of the 13 repeat copies in 
Galago crassicaudatus whilst two codons have been deleted in all repeat copies in T. 
bancanus. Consensus sequences also differ between nonanthropoid species in 
terms of nucleotide differences that occurred in any position in a codon. Changes 
in a consensus nucleotide must have arisen from the occurrence of a nucleotide 
substitution in one repeat followed by the correction of the analogous nucleotides 
at the corresponding positions in the other repeats, with adjacent repeats tending 
to be more homogeneous (Phillips et al., 1990). Green and Djian (1992) have pro- 
posed that a form of intragenic gene conversion (see also Chapter 4, section 4.2.2, 
Ribosomal RNA genes and chapter 9, section 9.5) may have been responsible for 
this phenomenon. Although the P segment was retained by prosimians and tar- 
sioids, it was modified by site-specific deletions, the addition of certain repeats 
and nucleotide substitutions which served to alter the consensus codons of the 
repeats by a process of correction operating between neighboring repeats (Djian 
and Green, 1991). It is possible that the retention of the P segment was essential 
for the function of the protein since repeats at site M were lacking. 

In the anthropoid apes, the M segment increased in size concomitant with the 
dramatic reduction in size of the P repeat segment (Figure 8.15). Perhaps deletion 
of the P segment was only possible after a sufficient number of repeats had 
been generated within the M segment. This notwithstanding, the M segment 
continued to grow by successive addition of repeats from prosimians through 
New World monkeys and Old World monkeys to hominoids. However, the 
repeats have the same consensus sequence in all the anthropoid apes (Figure 8.16). 
The M segment has been divided into early, middle and late regions which have 
been added vectorially in a 3’>5’ direction (Djian and Green, 1989) and the 
repeats from these regions are shared to differing extents by different anthropoid 
species (Teumer and Green, 1989). Thus, the early region is common to all anthro- 
poids, the middle region is shared by different species but to a lesser extent, and 
the late region is species-specific and must have developed after the divergence of 
the different species. An evolutionary tree of the M region of involucrin in higher 
primates is shown in Figure 8.17. After its divergence from the cercopithecoids, 
the hominoid lineage added 10 repeats to the middle region. After divergence 
from the common lineage, Hylobates, Pongo, and Homo acquired species-specific 
repeats in the late region. Another shared repeat segment is the early/middle (e/m) 
extension 5’ to the late region which was probably formed independently in Old 
World and New World monkeys. 

The site of repeat addition, which in the common anthropoid lineage was 
located in the P segment, moved in a 5’ direction during the evolution of both the 
New World and Old World monkeys after their divergence (Green and Djian, 
1992). Deletions of repeats also took place within the early, middle and e/m 


CONTRACTIONS AND EXPANSIONS IN GENE SIZE AND NUMBER — CHAPTER 8 371 


5} 
A 
B 
Number of B + BS 8 7 7 8 9 10 
repeats Whites 3' Blacks 


Figure 8.18. Polymorphism of the repeat pattern of the late region of the M segment of 
human involucrin. Each column of rectangular boxes indicates an allele, and each box 
denotes a 10-codon repeat. There are three different kinds of repeat: A (diagonal lines), B 
(shaded) and Bs (chequered) which differ with respect to the sequences of the first three 
codons (AAGCACCCG, GAGCTCCCA, and GAGCTCTCT respectively). The two 
columns on the left denote the variable number of B* repeats in whites whilst the four 
columns on the right denote the variable numbers of B and Bs repeats found in blacks. 
Arrowheads indicate the position of the hotspots for repeat addition in the white and 
black populations (redrawn from Green and Djian, 1992). 


regions of the M segment although these were relatively few in number as com- 
pared to the additions (Green and Djian, 1992). These deletions have occurred 
independently of the vectorial process of repeat addition both spatially and tem- 
porally (Green and Djian, 1992). 

In addition to the inter-specific differences in the size of the involucrin mole- 
cule, the higher primates also exhibit size polymorphisms which result from vari- 
ation in the number of repeats in the late region. In humans, the polymorphism 
comprises variable numbers of B and BS repeats within the late region (Figure 
8.18). The most common allele in white Caucasians contains nine repeats (3B, 
5B, 1A) but a second ‘Mormor allele (2B$, 5B, 1A) has also been described (Simon 
et al., 1989; Urquhart and Gill, 1993). Blacks possess the ‘Mormon’ allele plus 
three other alleles containing 6, 7, or 9 B repeats (Simon et al., 1991; Urquhart and 
Gill, 1993; Figure 8.18). Repeat number polymorphisms in the late region have 
also been reported in Aotus trivirgatus, Macacca mulatta and Gorilla gorilla (Green 
and Djian, 1992). The repeat pattern observed in anthropoid apes is explicable in 
terms of the presence in the involucrin gene of a hotspot at which repeats are gen- 
erated by unequal recombination (Green and Djian, 1992). In the human lineage, 
the location of this hotspot varies between racial groups such that in whites it is 
located within the BS repeat region whilst in blacks, it is within the B repeat 
region (Figure 8.18). Thus, although the process of vectorial repeat addition has 
operated in both blacks and whites, there has been a difference between these 
racial groups in the sites of repeat addition, a finding that is consistent with an 
endogenously controlled mechanism. 

Attempts have been made to relate the evolution of a larger involucrin molecule 
with a different repeat structure to the trend in anthropoid apes towards relative 
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Table 8.3. Mechanism-based evolution of the repeat segments of involucrin in mammals 
(from Green and Djian, 1992) 


Mechanism Organism in which mechanism 
operates 


Shortening or lengthening of pre-existing Site P of prosimians, tarsioids and non-primate 
repeats by site-specific deletion or insertion. mammals. Not operative in anthropoid apes. 


Change in consensus nucleotide at certain Site P of prosimians, tarsioids and non-primate 
positions resulting from mutation in one mammals. Not operative in anthropoid apes. 
repeat followed by correction of 

neighboring repeats (probably by gene 

conversion). Many of the corrections are 

silent. 


Addition of shorter repeats as incomplete Site P of tarsioids. Not operative in anthropoid 
copies of older ones. apes. 


Generation of a new duplication site (M) by Tarsioids and anthropoid apes. Aborted in 
vectorial addition of repeats at a controlled tarsioids. Not operative in prosimians and non- 
hotspot moving in a 3'—5’ direction. The primate mammals. 

site of addition differs in different human 

populations and is therefore under genetic 

control. 


hairlessness (Green and Djian, 1992). The P segment of repeats could conceivably 
have conferred some selective advantage during the early stages of mammalian 
and primate evolution as might the M segment when it emerged. At a biochemi- 
cal level, natural selection might have favored the retention of repeat-bearing seg- 
ments in involucrin so as to provide a substrate for transglutaminase. Once the M 
segment had been generated, the repeats at site P could then have been deleted 
without selective disadvantage. However, there is a very considerable degree of 
latitude between different species in terms of what constitutes an effective transg- 
lutaminase substrate. In a comparison of the involucrins of 19 mammalian 
species, only 3.1% of amino acid residues were uniformly conserved (Djian et al., 
1993). Thus, on balance, neither the very high rate of evolutionary change nor 
the observed patterns of mutational change in the involucrin gene provide 
convincing evidence for natural selection being the major motive force behind 
involucrin evolution in higher primates. Indeed, it is unlikely that natural selec- 
tion could account on its own for either the highly variable pattern of repeat addi- 
tion or the numbers of repeats added. 

Natural selection and neutral mutation/genetic drift are both processes that 
govern how mutations, once they have arisen, spread through a population with 
the possibility of eventually becoming fixed. By contrast, the evolution of involu- 
crin appears to be primarily mechanism-based, being dominated by endogenously 
controlled and spatially targeted mechanisms of mutagenesis. Such mechanisms 
are likely to be the major factor responsible for the directed evolution of involucrin, 
directed in the sense that the mechanisms promote change in a specific direction 
(i.e. increases in repeat number or gene conversion between homologous repeats; 
see Table 8.3) without the necessary assistance of natural selection. 
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Figure 8.19. Hypothetical scheme for the evolution of the involucrin (IVL), loricrin 
(LOR) and small proline-rich protein (SPRR) genes from a single ancestral gene (after 
Backendorf and Hohl, 1992, Volz et al., 1993). The repetitive central domains are 
represented by a triangle (loricrin), a diamond (involucrin) and circles (SPRR proteins). 
The number of repeats is represented by n (n=39 for involucrin, 6 for SPRR1, 3 for 
SPRR2, 14 for SPRR3). The broken arrows indicate an alternative evolutionary route. H 
denotes homogenization between subfamily members. D denotes divergence of different 
classes of genes. 


Interestingly, several other human genes encoding epithelial proteins are 
located on the long arm of chromosome 1 [filaggrin (FLG), loricrin (LOR), 
epithelial mucin (MUCI), and the small proline-rich proteins 1A (SPRRIA), 1B 
(SPRR1B), 2A (SPRR2A), and 3 (SPRR3); Volz et al., 1993]. These genes share a 
similar structure, contain multiple tandem repeats and are believed to be evolu- 
tionarily related both to involucrin and to each other (Backendorf and Hohl, 
1992; Gibbs et al., 1993; Figure 8.19). The LOR (Yoneda et al., 1992), FLG (Gan et 
al., 1990) and MUCI (Gendler et al., 1990) genes also exhibit repeat copy number 
polymorphism. The apparent association between chromosomal location and 
copy number polymorphism almost certainly reflects the possession by evolu- 
tionarily related proteins of a repeat structure that is particularly prone to the 
process of vectorial addition of repeats. 
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Gross gene 
rearrangements 


Cases are known of the doubling of the entire chromosome outfit, the dou- 
bling of single chromosomes, and of parts of chromosomes; in other cases a 
part of a chromosome appears to be translocated from its habitual site and 
attached to some other chromosome. The grosser forms of mutation may 
indeed play a special evolutionary role in supplying a mechanism of repro- 
ductive incompatibility. 

R.A. Fisher (1930) The Genetical Theory of Natural Selection. 


Gross gene rearrangements involving the inversion, translocation or fusion of 
DNA sequences are relatively rare in inherited disease although much more com- 
mon in cancer. Clearly, such mutations can lead to dramatic changes in gene 
structure, function and/or expression. At first glance, it may therefore seem some- 
what surprising that there are numerous examples of such mutations that have 
occurred during evolution. However, genes do not necessarily always change by a 
slow incremental process of single base-pair substitution and sudden gross 
changes have also played an important role in fashioning our genes. 


9.1 Inversions 


Most examples of inversions that have occurred during mammalian evolution 
have involved large portions of chromosomes and are usually pericentric in that the 
breakpoints are located on opposite chromosomal arms with the inversion span- 
ning the centromere. This contrasts with paracentric inversions in which the 
inversion involves breaks on the same chromosomal arm, with a segment from 
only one chromosome arm being inverted. 


9.1.1 Pericentric inversions 


A good example of a pericentric inversion that has occurred during evolution is 
that of two clusters of zinc finger protein genes on human chromosome 10 
(Tunnacliffe et al., 1993). Cluster A, comprising ZNFI1A, ZNF25, ZNF33A, and 
ZNF37A, is located on chromosome 10p11.2 whilst cluster B, which is a partial 
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duplicate of cluster A, comprises ZNF11B, ZNF33B, and ZNF37B and is located 
on chromosome 10q11.2. Duplicated gene clusters are usually contiguous but these 
ZNF clusters are located on opposite sides of the centromere consistent with their 
involvement in a pericentric inversion subsequent to the duplication. Tunnacliffe 
et al. (1993) proposed that this event took place during primate evolution. 

The human genes encoding NADP-dependent malate dehydrogenase (ME/; 
6q12), glutathione S-transferase 2 (GSTA2; 6p12) and phosphoglucomutase 3 
(PGM3; 6p12) form part of a chromosome 9 syntenic group in mouse (Kasahara 
et al., 1990). The extension of this syntenic group across the centromere of chro- 
mosome 6 in human is indicative of a pericentric inversion which must have 
occurred since the divergence of rodents and humans. Other examples of inferred 
pericentric inversions that have occurred during primate evolution involve the B- 
casein (CSN2; 4q13-q21; McConkey et al., 1996) gene, the GLI oncogene on chro- 
mosome 12q13 (Conte et al., 1998), the fibroblast growth factor genes FGF3 and 
FGF4 on chromosome 11q13 (Conte et al., 1998) and the steroid sulfatase 
pseudogene (STSP; Yq11; Yen et al., 1988). Finally, a Y-specific pericentric inver- 
sion occurs as a polymorphism among Gujarati Indians (Spurdle and Jenkins, 
1992). 

Pericentromeric regions appear to be particularly prone to rearrangement 
(Eichler, 1998) and this genetic instability could be related to the presence of spe- 
cific repetitive sequences (Wohr et al., 1996). 


9.1.2 Paracentric inversions 


An example of a human-specific paracentric inversion is that involving the short 
arm of the Y chromosome (Schwartz et al., 1998). A single contiguous segment of 
Xq21 is homologous to two noncontiguous segments of Yp and it is likely that the 
transposition of a ~4 Mb segment from the X to the Y chromosome ~3—4 Myrs 
ago after the divergence of human from chimpanzee was followed by an inversion 
of the sequence. Schwartz et al. (1998) suggested that this inversion could have 
been mediated by recombination between two LINE elements misaligned at 
homologous CATTATTCT motifs. Since humans from different racial groups 
(Caucasian, African, and Asian) have all been found to possess this Yp inversion, 
the rearrangement must have occurred prior to the radiation of the human racial 
groups. 

Another rather different type of paracentric inversion is exemplified by the 
Xq28 inversion polymorphism involving the Emery-Dreyfuss muscular dystro- 
phy (EMD) and filamin (FLN1) genes. Recombination between two large 11.3 kb 
inverted repeats (with >99% sequence homology) flanking these genes has been 
responsible for a frequent and clinically asymptomatic inversion of the 48 kb 
FLNI1/EMD region (Small et al., 1997; see Figure 9.1). The inversion polymor- 
phism may have resulted either from inter- or intra-chromatid exchange between 
misaligned repeats. Whichever, it is very common in the general population 
occurring in the heterozygous state in 33% of females and 19% of males (18% of 
human X chromosomes surveyed). 

Other regionally localized paracentric inversions may be inferred from the 
presence of long inverted repeat regions containing homologous but divergently 
transcribed genes. Examples of this include the a-amylase (AMYJA and AMYIB; 
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Figure 9.1. Structure of the Xq28 region containing the human emerin (EMD) and 
filamin (FLN1) gene loci (from Small et al., 1997). The exons of the EMD and FLNI1 
genes are indicated by black boxes, some of which are numbered for orientation. The 
direction of transcription for each gene is indicated by small arrows below. The thick 
black arrows represent the 11.3 kb inverted repeats. The black circle represents the 
centromere and the position of the telomere is indicated. 


1p21; Groot et al., 1990) and the a- and B-fibrinogen (FGB and FGG; 4q31; Kant 
et al., 1985) genes. Genes which are divergently transcribed, evolutionarily related 
and which share bidirectional promoters (see Chapter 5, section 5.1.5) are likely 
to have arisen by a process of duplication and inversion. Other genes have clearly 
evolved by this same process but are almost certainly too far apart to share pro- 
moter elements, for example the serum amyloid Al and A2 genes (SAA1, SAA2; 
15-20 kb apart on chromosome 11p15) and the GABA receptor B3 and a5 genes 
(GABRB3, GABRAS; 100 kb apart on chromosome 15q11-q12). 


9.1.3 Physiological and pathological inversions 


Somatic inversions, both relatively small and rather larger involving megabase- 
sized DNA fragments, occur physiologically during rearrangement of DNA at the 
immunoglobulin x (IGKV; 2p12) and T cell receptor B (TCRB; 7q35) loci in 
human (Weichhold et al., 1990). In human pathology, sporadic chromosomal 
inversions are not uncommon although each type of inversion is likely to be indi- 
vidually rare. By contrast, intragenic DNA sequence inversions, occurring as a 
result of recombination between inverted repeats in the germline, are highly 
unusual. The best known example is that found in the factor VIII (F&C) gene 
causing hemophilia A: this rearrangement occurs in about 40% of severely 
affected patients and recurs at high frequency (Lakich et al., 1993; Naylor et al., 
1993). The mechanism responsible is thought to be homologous intrachromoso- 
mal recombination between a gene (F8A) located in intron 22 of the F8C gene and 
one of two additional homologues of the F8A gene situated 500 kb upstream of the 
F8C gene. 


9.1.4 Intragenic inversions 


Not surprisingly, intragenic inversions occurring during gene evolution are 
extremely uncommon. One example is however provided by the family of inter-a- 
trypsin inhibitors encoded by four genes in the human genome (ITIH1, ITIH3, 
ITIH4, 3p21; ITIH2, 10p14-p15). The ancestral ITIH gene was first duplicated 
with one copy being translocated to chromosome 10 about 300 Myrs ago (Diarra- 
Mehrpour et al., 1998). Prior to further gene duplication and divergence, the pri- 
mordial ITIH1/ITIH3/ITIH4 gene experienced an inversion of exons 3-13. 
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9.1.5 Common sites for inversions in pathology and evolution? 


It has been known for some time that chromosomal inversions in human pathol- 
ogy are nonrandomly distributed although it remains unclear whether this is due 
to interchromosomal differences or to bias of ascertainment (Dutrillaux et al., 
1986; Madan, 1995). Some of these inversions have nevertheless been character- 
ized with sufficient precision to allow comparison with inversions that have 
occurred in the karyotypic evolution of the hominids (Dutrillaux, 1988; Yunis 
and Prakash, 1982). Miro et al. (1992) performed such a comparison and demon- 
strated that 10/20 pericentric inversions and 1/4 paracentric inversions which had 
occurred during human chromosome evolution coincided (albeit at low level res- 
olution) with known sites of pathological inversion. These (evolutionary) inver- 
sions were inv(1) (p12q21.22), inv(2) (p11.2q13), inv(4) (p14q21.1), inv(5) 
(p13.3q13.3), inv(7) (q11.23q22.2), inv(8) (p21.1q22.1), inv(9) (p24.2q12), inv(11) 
(p15.5q13), inv(16) (p11.2q12.1), inv(18) (p11.32q11.2), and inv(Y) (p11.2q11.23). 
If the sites that have been involved in chromosome inversions during primate 
evolution were also to be involved in cases of human chromosome pathology, this 
would argue for certain chromosomal regions possessing sequence characteristics 
(possibly long highly conserved inverted repeats) that could predispose to this 
type of lesion. Thus higher resolution studies such as that on human chromosome 
12q15 (which contains breakpoints associated with both benign solid tumors and 
a pericentic inversion that occurred during hominoid evolution; Nickerson and 
Nelson, 1997) could yield valuable insights into the mechanisms underlying chro- 
mosome rearrangement in both pathology and evolution. 


9.2 Translocations and transpositions 


The human genome is replete with examples of the transposition and transloca- 
tion of gene sequences during evolution. A selection of some of the most notable 
examples are given here while others have been discussed in the context of gene 
duplication (see Chapter 8, section 8.5). The occurrence of most translocations 
has been inferred from the localization of evolutionarily related genes on different 
chromosomes or different chromosomal arms. One example is provided by the 
human aminoacyl-tRNA synthetase gene family (Brenner and Corrochano, 1996) 
whose evolution is depicted in Figure 9.2. Another is the human y-aminobutyric 
acid receptor (GABA,R) which comprises several different types of subunit that 
combine to form a pentameric channel complex and which are encoded by a small 
but dispersed gene family. The GABRA1, GABRB2 and GABRG2, genes have 
been localized to chromosome 5q34-q35 (Russek and Farb, 1994) whilst similar 
clusters comprising GABRA2, GABRBI, and GABRGI (4p13-pl2) and 
GABRAS, GABRB3 and GABRG3 (15q11.2-q12) are present on two other chro- 
mosomes. This organization is compatible with the duplication of an ancestral 
gene cluster and its translocation to other chromosomes. 

Tryptophan hydoxylase, tyrosine hydroxylase and phenylalanine hydroxylase 
are members of the family of pterin-dependent aromatic amino acid hydroxylases 
and are encoded by the TPH (11p15.3-p14), TH (11p15.5) and PAH (12q22-q24.1) 
genes, respectively. PAH and TPH are estimated to have arisen by a process of 


GROSS GENE REARRANGEMENTS — CHAPTER 9 393 


32-aa repeat Rest of 


exons the gene Intron loss 
mB \iw GQ -mE 
Ancestral class 2 Prolyl-tRNA synthetase gene 


aminoacyl-tRNA 
synthetase gene 


Inverted duplication 

and translocation IN E | HS 
HO3 Histidyl-tRNA 

| IN Human synthetase gene 


Histidyl-tRNA synthetase gene 


Translocation through -RIESE 
the first intron 
—_ > RIELE 


el Histidyl-tRNA 
synthetase gene 


Class 2 aminoacyl-tRNA 


Translocation 
synthetase gene through the Intron 
p second intron as loss 
_ > HEZA — | | 
Tryptophanyl-tRNA 
Tryptophanyl-tRNA synthetase gene 


synthetase gene 


Figure 9.2. Hypothetical translocation events in the evolution of the aminoacyl-tRNA 
synthetase genes (after Brenner and Corrochano, 1996). The two exons encoding the 32- 
amino acid repeat and the rest of the gene are shown by stippled and hatched boxes 
respectively. An ancestral class 2 aminoacyl-tRNA synthetase gene containing the two 
repeat-containing exons gave rise to the prolyl-tRNA synthetase and histidyl-tRNA 
synthetase (HARS) genes. The prolyl-tRNA synthetase gene of extant animals contains 
several copies of the repeat and is fused to the gene encoding glutamyl-tRNA synthetase 
(EPRS). The HARS gene underwent an inverted duplication and a translocation in the 
human lineage resulting in two genes in opposite orientation. The translocation may 
have occurred through the first intron, resulting in the capture of a new exon in the 
histidyl-tRNA synthetase-homologous (HO3) gene. In the Fugu lineage, a translocation 
through the first intron also allowed the capture of a new exon. The human tryptophanyl- 
tRNA synthetase (WARS) gene, encoding a class 1 enzyme, has captured the two exon 
repeat by a translocation into a class 2 gene. The loss of a further intron resulted in the 
structure of the extant WARS gene in humans. 


duplication and divergence some 750 Myrs ago whilst PAH and TH diverged sub- 
sequently about 600 Myrs ago (Craig et al., 1986; Ledley et al., 1987; Figure 9.3). 
The synteny of the extant TPH and TH genes implies that the PAH gene must 
have been translocated to chromosome 12 only after the second duplicational 
event 600 Myrs ago (Figure 9.3). A similar explanation may pertain for the insulin 
(INS; 11p15.5), insulin-like growth factor 1 WGF1; 12q22-q24.1) and insulin-like 
growth factor 2 JGF2; 11p15.5) genes (Bell et al., 1985; Tricoli et al., 1984). 

In humans and orangutans, two tRNA“ gene clusters are located on the short 
and long arms of chromosome 1, respectively (TRN, 1p36.1; TRNL, 1q21) 
(Buckland et al., 1992). By contrast, Old World monkeys possess only one tRNA4" 
gene cluster on chromosome Ip whilst the capuchin (a New World monkey) 
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Figure 9.3. Genetic events in the evolution of the tryptophan hydroxylase (TPH), 
tyrosine hydroxylase (TH) and phenylalanine hydroxylase (PAH) genes from an ancestral 
hydroxylase gene (hyd). 1. Duplication giving rise to TPH gene. 2. Duplication giving 
rise to TH and PAH genes. 3. Translocation resulting in TPH and TH remaining on the 
same chromosome whilst PAH is separated from them. 


possesses a single cluster on 1q. These data are consistent with the interpretation 
that the tRNA4* gene cluster split before the divergence of Old World monkeys 
and hominoids about 30 Myrs ago. No simple inversion can account for this situ- 
ation and, since this chromosome is well conserved in primate evolution, 
Buckland et al. (1992) speculated that one tRNA4" gene cluster could have been 
relocated from one chromosome arm to the other by some form of replicative 
transposition. 

Another example of a gene translocation occurring during the evolution of the 
primates is provided by the adenine nucleotide translocator 3 (ANT3) and steroid 
sulfatase (STS) genes. The ANT3 and STS genes are pseudoautosomal (Xp22.32) 
in humans and other higher primates but both localize to an autosome in lemurs 
suggesting that there was an autosome-to-X/Y translocation after the simians 
diverged from the prosimians (Toder et al., 1995). 


9.2.1 Pericentromeric-directed transposition 


Several examples of pericentromeric-directed transposition have been described 
in primate genomes and since these have been well characterized, they are worthy 
of discussion in some detail here. Comparative FISH analysis of chromosomes 
from various primates has shown that a 27 kb region of Xq28 containing the cre- 
atine transporter (SLC6A8) gene and five exons of the ‘CDM’ gene (DXS1357E) 
has been duplicated in its entirety and translocated to 16p11.1, probably within 
the last 10 Myrs (Eichler et al., 1996; Figure 9.4). An inverted cluster of Alu 
repeats and a number of immunoglobulin-like CAGGG repeat motifs lie in the 
vicinity of the Xq28 breakpoint. Either type of repeat could have mediated the 
transposition event but the CAGGG repeats may be the more likely candidate for 
involvement since they have been found to flank a number of other recently 
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duplicated gene sequences (Eichler et al., 1998a). The chromosome 16 SLC6A8 
paralogue (SLC6A10) retained both its putative promoter and transcriptional 
activity (Iyer et al., 1996) but the chromosome 16 CDM paralogue clearly repre- 
sents a truncated pseudogene. A very similar Xq28—-16p11 transposition, reported 
by Eichler et al. (1997), involves the independent transposition of a 9.7 kb seg- 
ment encompassing exons 7-10 of the adrenoleukodystrophy (ALD; Xq28) gene 
to the pericentromeric regions of chromosomes 2p11, 10p11, 16p11, and 22q11). 
This ALD paralogy domain lies only 27 kb telomeric to the SLC6A8/CDM paral- 
ogy domain (Figure 9.4). The autosomal sequences represent truncated non- 
processed pseudogenes with sequence divergence being consistent with the 
duplication/transposition events having occurred between 5 Myrs and 10 Myrs 
ago. As far as the ALD transposition events are concerned, an initial event 
directed to chromosome 2p11 appears to have served as a ‘seed’ for further peri- 
centromeric-directed transposition events to occur (Figure 9.4). An inverted clus- 
ter of Alu repeats lies in the vicinity of the Xq28 breakpoint. In addition, Eichler 
et al. (1997) noted the presence of aGCTTTTTGC repeat flanking the duplicated 
region which they speculated might serve as a sequence-specific integration site 
for transposition. Such a sequence may provide a hyper-recombinogenic signal 
which could account for the propensity of this locus to be involved in transposi- 
tional events. 

Other pericentromeric-directed duplications/transpositions in the human 
genome (reviewed by Eichler 1998) involve the immunoglobulin Vx light chain 
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Figure 9.4. Pericentromeric-directed transposition events from the ALD/SLC6A8 locus 
at Xq28 to other chromosomes. An initial duplication/transposition targeted 
chromosome 2p11 from where subsequent transposition events were directed to 16p11, 
10p11, and 22q11. The paralogy domains encompassing, (i) the SLC6A8 and CDM genes, 
and (ii) the ALD gene are boxed. CEN, centromere; TEL, telomere. 
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locus (Borden et al., 1990; Zimmer et al., 1990), the immunoglobulin heavy chain 
Va region (Tomlinson et al., 1994), the GABA A receptor a&5 (GABRAS) gene 
(Ritchie et al., 1998) and the neurofibromatosis type 1 (NF1) gene (Regnier et al., 
1997). The inter-chromosomal duplication and transposition of NFI-related 
sequences may be related to the presence in the vicinity of o-satellite sequences 
that could have served to promote their dispersal. The transposition associated 
with the immunoglobulin Vx light chain locus (GGKV; chromosome 
2p11.1>chromosome 1; Arnold et al., 1995) was directed to a site containing an 
ALD like GCTTTTTGC repeat suggesting that a similar mechanism may have 
been responsible for these transpositional events. 

The pericentromeric zinc finger gene cluster on chromosome 19p12 which 
harbors the ZNF208 gene is flanked by large blocks of B-satellite repeat sequences 
(Eichler et al., 1998b). This gene cluster is thought to have arisen early in primate 
evolution (~50 Myrs ago) by a process of pericentromeric-directed transposition. 
Eichler et al. (1998b) proposed a model in which an ancestral ZNF gene became 
associated with B-satellite repeat sequences at 19p12. Such repeats are capable of 
rapid expansion, possibly by unequal crossing over, and may have served to pro- 
mote the rapid amplification of the associated ZNF gene. 

The above examples serve to indicate that the pericentromeric regions of 
human chromosomes have frequently acquired sequences from remote genomic 
locations. Indeed, these regions appear to be very dynamic, being subject to 
amplification, duplication, deletion and inversion events as well as translocations 
(Eichler, 1999; Jackson et al., 1999). Many of these pericentromeric rearrange- 
ments have occurred relatively recently during primate evolution leading to 
marked inter-specific differences even among the great apes. The potential evolu- 
tionary importance of pericentromeric regions can perhaps be gauged from 
Eichler’s (1999) description of them as ‘recruitment stations for repeats’ and 
‘reservoirs for the accumulation of transposed genic segments.’ 


9.2.2 Sub-telomeric transposition 


Sub-telomeric regions may be similarly dynamic. Thus the rapid proliferation of 
multigene families clustered near telomeres may have occurred by repeat-medi- 
ated bursts of duplication/transposition of short stretches of genomic DNA (e.g. 
the olfactory receptor genes; Trask et al., 1998). Duplicational transposition also 
resulted in the telomeric localization of a number of pseudogenes derived from 
the Chll-related helicase (DDX11, 12p11; DDX12, 12p13; Amann et al., 1996) 
and interleukin 9 receptor JL9R; Xq28/Yq12; Kermouni et al., 1995) genes. The 
‘spreading’ of a sub-telomeric region has also been described by Monfouilloux et 
al. (1998); originally localized to 17qter in chimpanzee and orangutan, a specific 
sub-telomeric domain has been translocated in humans and has colonized several 
other chromosome ends. Finally, there is emerging evidence for sequence 
exchange between human telomeric and centromeric regions (Eichler et al., 1997; 
Jackson et al., 1999; Vocero-Akbani et al., 1996). This may have had consequences 
for the growth and distribution of multigene families as in the case of the human 
olfactory receptor genes which, although clustered predominantly in subtelom- 
eric regions, are also located in pericentromeric regions (Rouquier et al., 1998). 
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Some sequence exchanges that have been reported as translocations may, how- 
ever, upon closer scrutiny be explicable by other mechanisms. The ‘inter-chro- 
mosomal polymorphism’ involving a sub-telomeric exchange between regions on 
chromosomes 4q35 and 10q26, which has been shown to be present in at least 20% 
of the human population, is a case in point (van Deutekom et al., 1996). These 
regions are thought to be 100-400 kb in length and contain 3.3 kb tandem repeat 
units (each containing two homeodomain sequences and two different classes of 
GC-rich repetitive DNA) that are 98% homologous to each other (Cacurri et al., 
1998). An explanation invoking frequent inter-chromosomal translocation would 
be virtually unprecedented. The most parsimonious explanation would appear to 
be inter-chromosomal gene conversion followed by nonhomologous recombina- 
tion leading to repeat homogenization. 


9.2.3 Translocations and chromosome associations 


Do some chromosomal rearrangements occur as a consequence of the spatial asso- 
ciation of particular chromosomes on the metaphase plate? Nagele et al. (1995) 
have suggested that human chromosomes are arranged in antiparallel fashion as a 
rosette and that this chromosome arrangement is both consistent and cell-type 
independent in human cells. The high frequency of translocations involving cer- 
tain chromosomes, whether in pathology (somatic/germline; Rabbitts, 1994; 
Mitelman et al., 1997) or evolution, may therefore be due at least in part to certain 
chromosomes being positioned in close proximity to each other during the cell 
cycle. 


9.2.4 Translocations in pathology and evolution 


As with inversions (Section 9.1), the chromosomal sites that have been involved 
in translocations during mammalian evolution sometimes crop up in cases of 
human chromosome pathology. Thus the breakpoint of X;1 papillary renal cell 
carcinoma-associated translocations has been mapped to a <200 kb region 
between the SPTA/ and CDIC genes (Weterman et al., 1996) at the same 1q21 
location that contains the boundary between human and mouse syntenic regions 
(Oakey et al., 1992). Another example of such a correspondence is provided by the 
holoprosencephaly critical region at 2p21, a site which is associated with a 
translocation breakpoint in a gibbon karyotype (Arnold et al., 1996). If such cor- 
respondences are not simply coincidental, these chromosomal regions could pos- 
sess characteristics that would predispose to translocations. Their study could 
therefore yield valuable insights into the mechanisms underlying chromosome 
rearrangement in both pathology and evolution. 


9.3 Gene fusion 
Gene fusions resulting from somatic chromosomal translocations are a common 


cause of tumorigenesis (Barr, 1998). By contrast, gene fusions occurring in the 
germline and causing genetic disease (e.g. hemoglobin Lepore) are a highly 


398 HUMAN GENE EVOLUTION 


unusual cause of human genome pathology (reviewed by Cooper and Krawczak 
1993). Such fusions are however thought to have had an important role in evolu- 
tion through the creation of novel genes encoding novel combinations of func- 
tional protein domains; some examples of this phenomenon are described below. 
As we have seen in Chapter 3, section 3.6, exon shuffling has been an important 
process in evolution, serving to bring together coding sequence blocks from dif- 
ferent sources to generate new proteins capable of novel interactions and there- 
fore, potentially, with novel functions. In a sense, therefore, all proteins have 
evolved by a series of gene fusion events. In this section, however, we concentrate 
solely on the generation of novel genes through the fusion of once distinct and 
independently functional gene sequences by recombination. 


9.3.1 Gene fusion during evolution 


Carboxyesterase El, an enzyme responsible for the detoxification of ingested 
xenobiotics, exhibits sequence homology both with acetylcholinesterase and the 
carboxy terminal end of thyroglobulin, a precursor of thyroid hormone (Takagi et 
al., 1991). Phylogenetic analysis has suggested that both acetylcholinesterase 
(human ACHE gene on 7q22) and thyroglobulin (human TG gene on 8q24.2- 
q24.3) evolved from a common ancestral gene that encoded a carboxyesterase and 
that the emergence of thyroglobulin must have preceded the divergence of the 
vertebrates and invertebrates (Takagi et al., 1991). The evolutionary origin of the 
amino terminal portion of extant thyroglobulin is however unknown. 

The human multidrug resistance (PGY/; 7q21.1) gene encodes a membrane- 
associated pump protein called P-glycoprotein. Sequence homology between the 
amino and carboxy terminal halves of the protein was initially held to be consis- 
tent with the view that the protein evolved by duplication of a primordial gene. 
However, once the structure of the 29 exon PGYI gene had been determined, it 
became clear that only two intron pairs (both within nucleotide binding domains) 
were located in conserved positions in the two halves of the protein. Thus, rather 
than a primordial duplication, Chen et al. (1990) proposed that primordial pro- 
teins corresponding to the left and right halves of P-glycoprotein were formed 
independently by the fusion of closely related genes encoding the nucleotide- 
binding domain with genes for different transmembrane domains. Subsequent 
fusion of these two independently derived genes then resulted in the formation of 
the PGYI gene. Pauly et al. (1995) have speculated that a highly conserved 
poly(CA).poly(TG) sequence in intron 15 of the PGY/ gene could have, with Alu 
sequences in introns 14 and 17, mediated such a fusion event by recombination. 

The human glutaminyl-tRNA synthetase (EPRS) gene serves to encode two 
distinct aminoacyl-tRNA synthetase activities joined as part of a multi-enzyme 
synthetase complex. The EPRS gene, located at chromosome 1q41-q42, com- 
prises 29 exons spanning some 90 kb (Kaiser et al., 1994). Exons 4 to 10 encode a 
glutaminyl-tRNA synthetase (Kaiser et al., 1992) whilst exons 19 to 29 encode a 
prolyl-tRNA synthetase (Kaiser et al., 1994). The function of the intervening 
region (exons 11 to 18) is unclear (actin binding?; Kaiser et al., 1994). Since the 
glutaminyl- and prolyl-tRNA synthetases belong to different enzyme classes 
which are believed to have evolved by separate pathways (Nagel and Doolittle, 
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1991), the composite EPRS gene would appear to represent the consequence of an 
ancient gene fusion. 

The human quiescin Q6 (OSCN6; 1q24) gene also appears to represent the prod- 
uct of an ancient gene fusion that occurred during metazoan evolution (Coppock et 
al., 1998). The N-terminal portion of the quiescin Q6 protein is related to the 
thioredoxin family (human members include thioredoxin (TXN; 9q31) and prolyl- 
4 hydroxylase B polypeptide (P4HB; 17q25)) whilst the C-terminal is related to the 
yeast Erv] growth regulatory gene (human homologue, GFER, 16p13). 

The human lamin B receptor is an integral protein of the inner nuclear mem- 
brane encoded by a gene (LBR) on chromosome 1q42.1. The LBR gene comprises 
13 protein coding exons, the first four of which encode the N-terminal domain 
and the remainder the C-terminal sterol reductase-like domain (Holmer et al., 
1998). A relatively large intron separating exons 4 and 5 is suggestive of an ancient 
recombination between two genes, one encoding a basic nuclear protein, the other 
a sterol reductase. Consistent with this interpretation, the 3’ end of the LBR gene 
is structurally homologous with two other human genes viz. the transmembrane 
protein TM7SF2 (TM7SF2; 11q13) and 7-dehydrocholesterol reductase 
(DHCR7; 11q13) (Holmer et al., 1998). 

A highly unusual example of gene fusion is provided by the human HLA- 
DRBB6 (6p21.3) gene. This gene had represented something of a paradox in that 
although both exon 1 and the promoter region are absent in the orthologous 
human, chimpanzee and rhesus macaque genes, these genes are still capable of 
being transcribed. This apparent paradox has now been resolved by the elucida- 
tion of the molecular basis of the change: the insertion of a retroviral (mouse 
mammary tumor virus) LTR into intron 1 of the primate HLA-DRB6 gene >23 
Myrs ago, prior to the divergence of the Old World monkeys from the human-ape 
lineage (Mayer et al., 1993; see Chapter 5, section 5.1.12, Endogenous retroviral ele- 
ments). Whether the exon/promoter deletion accompanied or instead followed the 
LTR insertion is unclear. What is clear is that an open reading frame for a new 
exon was created by the insertion which serendipitously encoded a hydrophobic 
sequence that was able to function as a leader for the truncated HLA-DRB6 pro- 
tein. The new exon provided a functional donor splice site at its 3’ end which 
potentiated in-register splicing with exon 2 of the HLA-DRB6 gene. The LTR 
also provided a substitute promoter region essential for the transcription of the 
downstream gene. Thus, this case not only provides an example of the de novo cre- 
ation of an exon but also illustrates a potential evolutionary mechanism to retar- 
get proteins within the cell. The incorporation of a novel leader peptide into, for 
example, a cytoplasmic protein that previously lacked one could result either in 
an alteration in the cellular localization of the protein or its conversion into a 
secreted protein. 

The mammalian defensins constitute a family of microbicidal and cytotoxic pep- 
tides made by neutrophils. In humans, they are encoded by a gene cluster (DEFA1, 
DEFA3, DEFA4, DEFAS, DEFA6, DEFB1) at chromosome 8p23 (Bevins et al., 
1996; Liu et al., 1997). After a primordial defensin gene had duplicated to yield two 
genes ancestral to extant DEFAS and DEFA6, an unequal crossing over event gen- 
erated a novel hybrid defensin gene which was the ancestor of the present day 
hematopoietic defensin genes DEFA1, DEFA3, DEFA4 (Figure 9.5). 
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Figure 9.5. Model for the involvement of an homologous unequal crossover in the 
evolution of the human defensin gene family (after Bevins et al., 1996). 

Shaded enclosed boxes denote exons, with exons I and II representing the two exons of 
the epithelial defensin genes DEFAS5 and DEFA6. Exon I’ is the characteristic upstream 
exon of hematopoietic defensins whilst exons II’ and III’ are homologous to exons I and 
II of the epithelial defensins. Solid boxes denote regions of striking conservation between 
all defensins. Small arrows denote the locations of the transcription start sites. 


The SCYA18 (17q11.2) gene, which encodes a member of the small inducible 
cytokine family, appears to have been generated by the fusion of two macrophage 
inflammatory protein-1a-like (SC YA3) genes with subsequent deletion and selec- 
tive use of exons (Tasaki et al., 1999). Since there are several related genes (SCYA3, 
SCYA3LI and SCYA3L2) in the vicinity of SCYA18, the authors suggested that 
the SCYA3 gene might represent a ‘hot spring’ that continually generates new 
genes by duplication and fusion. 

The human immunoglobulin yFc receptor IIC (FCGR2C; 1q23) gene is also 
thought to have resulted from an unequal crossover, this time between the 
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FCGR2A and FCGR2B genes (1q23; Warmerdam et al., 1993); the 5’ end of the 
FCGR2C gene exhibits significant homology to the FCGR2B gene whilst its 3’ 
end is homologous to the FCGR2A gene. Finally, two human ubiquitin genes rep- 
resent the products of an in-frame fusion event between a ubiquitin gene and a 
ribosomal protein gene encoding a protein of either 52 (UBA52; 19p13; Baker 
and Board, 1991, 1992) or 76-80 (UBA80) amino acids in length (Lund et al., 
1985). 


9.3.2 Internal methionines as evidence for ancient gene fusion events 


Berman et al. (1994) demonstrated that some 20% of eukaryotic proteins are mul- 
tiples of 123 + 3 amino acids. This modular structure has been held to reflect 
the combinatorial fusion of genes, initially of the same elementary size, during 
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Figure 9.6. Inferred visual pigment gene (RCP, GCP) arrangements in 26 individuals 
with normal color vision ( redrawn from Neitz et al., 1995). 
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evolution. One consequence of this fusion process would be the frequent occur- 
rence of methionine-encoding ATG triplets at the borders between unit length 
sequence segments, a prediction which has been bourne out by statistical analysis 
(Kolker and Trifonov, 1995). This positional preference of methionines testifies to 
the excision-reinsertion mechanism of protein construction and means that these 
internal methionines can justifiably be termed the ‘fossils of gene fusion.’ 
Further, it illustrates the probable importance of gene fusion events in the evolu- 
tionary construction of extant gene sequences. 


9.3.3 Fusion gene polymorphisms 


Some fusion genes occur in the human genome as polymorphic variants. One 
example is the red and green visual pigment genes (RCP, GCP; Xq28). Normal 
human males with trichromatic vision typically possess one RCP gene, one, two 
or more GCP genes plus an RCP/GCP hybrid gene (Figure 9.6; Neitz et al., 1995; 
see Chapter 7 section 7.5.2, The visual pigments). In some individuals, the RCP 
genes contain a substantial amount of GCP sequence e.g. exon 4 of the GCP gene 
(individuals 23, 26, 03, 27, and 21 in Figure 9.6). Other individuals possess RCP 
genes that contain 5’ sequences derived from GCP genes (individuals 07, 09, 14, 
25, 06, 22, 20, 04, 08, 19, 18, 01 in Figure 9.6). The degree of polymorphism is dra- 
matic in that ~70% of individuals with normal trichromatic color vision possess 
one or other type of fusion gene (Figure 9.6). Other gene fusions have occurred as 
a result of unequal homologous crossing over to generate polymorphic variants in 
the MNS (GYPA; 4q31; Huang and Blumenfeld, 1991) and ABO blood group sys- 
tems (ABO; 9q; Olsson et al., 1997). 


9.3.4 Fusion splicing 


Although gene fusion can arise through deletion or translocation, it need not 
invariably occur as a result of DNA rearrangement. Gene fusion can, in functional 
terms, also be brought about by the fusion splicing of mRNA transcripts derived 
from two closely linked but unrelated genes. Evidence for such a mechanism has 
come from the cotranscription of two human genes encoding galactose-1-phos- 
phate uridylyl-transferase (GALT) and interleukin-11 receptor o-chain ([L11RA) 
which are closely linked (separated by only 4 kb) on chromosome 9p13 
(Magrangeas et al., 1998). GALT is a 43 kDa enzyme required for the conversion 
of galactose to glucose and is encoded by a gene which comprises 11 exons span- 
ning 4 kb and which specifies a 1.4 kb mRNA. The JL1/RA gene, a member of the 
hematopoietin receptor superfamily, comprises 13 exons spanning 8 kb and 
encodes a 48 kDa protein. Magrangeas et al. (1998) demonstrated that the two 
genes are sometimes cotranscribed in normal human cells and that the 3 kb fusion 
mRNA encodes an 85 kDa protein of unknown function and biological signifi- 
cance. The in-frame fusion transcript probably results from a combination of 
inefficient RNA polymerase II termination of GALT gene transcription and an 
alternative splicing event between exon 10 of the GALT gene and exon 2 of the 
IL1IRA gene. 

Magrangeas et al. (1998) speculated that fusion splicing could represent an 
‘exploratory event for evolution’; fusion proteins thus formed might initially be 
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produced at low cellular levels but would represent a reservoir of novel functions 
which could be called upon to confer a selective advantage under particular con- 
ditions. In such a situation, it may be envisaged that a point mutation at the 
acceptor splice site or polyadenylation site could then lead to the generation of 
much larger amounts of the fusion protein. Such a mutation is evident in the 
guinea pig gene that encodes seminal vesicle secretory proteins 1, 3, and 4 
(Hagstrom et al., 1996). The 5’ half of this guinea pig gene is homologous to the 
human semenogelin II (SEMG2; 20q12-q13) gene. Indeed, sequences related to 
the human SEMGZ2 gene are also found in the first intron of the guinea pig gene 
(Hagstrom et al., 1996). However, the 3’ half of the guinea pig gene shares homol- 
ogy with the closely linked skin-derived antileukoproteinase/elafin (PI3; 20q12- 
q13) gene. It would appear that, as a result of an AG transition, the guinea pig 
gene contains a novel AG dinucleotide 7 bp upstream of the 3’ splice site used in 
the SEMG2 gene. If used as a splice acceptor site, this AG dinucleotide would lead 
to a spliced product that is out of frame with the product of the SEMG2 gene, and 
the use of a splice acceptor site in the downstream PJ3 gene would have been 
favored (Hagstrom et al., 1996). Thus an alternative splicing pathway has led to 
the creation of a novel guinea pig gene that presumably encodes a protein with a 
new function. 

In principle, fusion splicing could also lead to the creation of novel genes 
through the germline retrotransposition of the fusion mRNA. Such novel fusion 
genes would however be expected to be characterized by the absence of introns 
(unlike the case of the guinea pig seminal vesicle secretory protein gene cited 
above) and to be chromosomally distant from the parental genes. There are at 
least two other possible ways of generating novel genes by fusion splicing. The 
first would be via trans-splicing (involving the joining of independently tran- 
scribed coding sequences from genetically unlinked loci) followed by retrotrans- 
position back into the genome. As yet, however, the available evidence for 
trans-splicing in mammalian cells must be regarded as somewhat tentative (Eul et 
al., 1995; Fujieda et al., 1996). The second mechanism would involve the retro- 
transposition of abnormally spliced gene transcripts comprising exons joined in 
an order different from that in which they are normally found in the genome 
(scrambled exons). Exon scrambling has been described for at least two human 
genes: DCC (18q21; Nigro et al., 1991) and MLL (11q23; Caldas et al., 1998). 


9.4 Recombination 


Mankind...will not willingly admit that its destiny can be revealed by the 
breeding of flies or the counting of chiasmata. 
C. D. Darlington (1960) 


9.4.1 Homologous recombination 


Numerous recombination events have now been documented as having occurred 
during the evolution of mammalian genes. Indeed, most gene duplication and 
amplification events have been mediated by recombination (see Chapter 8, 
section 8.5). Examples of the emergence of novel human gene sequences by 
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recombination-mediated duplication include those encoding saposin C (PSAP; 
10q; Rorman et al., 1992), pepsinogen A3 (PGA3; 11q13; Evers et al., 1989), man- 
nose-binding protein (MBP; 10q11-q21; Sastry et al., 1989) and the a-amylase 
genes (AMYIA, AMYIB, AMYIC, AMY2A, AMY2B; 1p21; Groot et al., 1990). 
Recombination events have also been invoked to explain cases of gene fusion (see 
Section 9.3). 

Multiple independent recombination events are thought to have occurred dur- 
ing the evolution of the two human haptoglobin genes (HP and HPR; 16q22.1) 
and their counterparts in the other primates (Erickson et al., 1992; Erickson and 
Maeda 1994; Maeda et al., 1986; Maeda, 1985; McEvoy and Maeda, 1988). Whilst 
the great apes and Old World monkeys possess three haptoglobin genes (Hp, Hpr, 
and Hpp), New World monkeys only have one (Hp). This is consistent with a gene 
triplication after the divergence of Old World from New World monkeys followed 
by a Hpp gene deletion in the human lineage. Breakpoint analysis suggests that 
both duplications and deletions of gene copies may have been mediated by 
homologous unequal recombination between Alu repeats. That this region is 
prone to recombination is also evidenced by the haptoglobin gene copy number 
polymorphism found in the black population (Maeda et al., 1986). Other examples 
of homologous recombination events thought to be responsible for human gene 
copy number polymorphisms include the a-globin (HBA1; Lie-Injo et al., 1981) 
genes, the C-globin (HBZ; Winichagoon et al., 1982) gene, the o-amylase genes 
(Groot et al., 1989), the proline-rich protein genes (PRBI, PRB2, PRB3, PRB4; 
12p13.2; Lyons et al., 1988) and the pepsinogen A genes (PGA3, PGA4, PGAS; 
11q13; Zelle et al., 1988); none of these copy number polymorphisms are known 
to have any clinical significance. 

Recombination has also resulted in the alteration of the expression level of a 
gene. Thus, the galago 5-globin gene is expressed at an unusually high level in the 
adult (18% of B-like globin chains, cf. 0-6% in other primates) as a result of a 
recombination event which replaced 2.4 kb of 5-globin gene sequence by B-globin 
gene sequence containing 800 bp of the promoter region (Tagle et al., 1991). 

Considerable effort has been made to localize and characterize the DNA 
sequences responsible for mediating the recombinational events known to have 
occurred during gene evolution. Thus, the duplication events involved in the evo- 
lution of the human glycophorin (GYPA, GYPB, and GYPE; 4q28.2-q31.1) genes 
(Labuda et al., 1995; Figure 9.7), the human growth hormone gene (GH1, GH2, 
CSH1, CSH2; 17q22-q24; Figure 4.13) cluster (Chen et al., 1989) and the mouse 
lysozyme genes (Cross and Renkawitz, 1990) are considered to have been medi- 
ated by recombination between Alu repeats. A correspondence between illegiti- 
mate recombination junctions and the sites of Alu sequence insertion has also 
been proposed for the primate a-globin (HBA, HBA2; 16p13.3) genes (Bailey 
et al., 1997; Shaw et al., 1991). The duplication of the primate y-globin (HBG1; 
11p15.5) gene may have been mediated by homologous recombination between 
LINE elements flanking a fetal globin progenitor gene (Fitch et al., 1991). High 
G+C content may promote recombination (Eyre-Walker, 1993) whilst Ashley 
et al. (1993) have suggested that telomeric repeats could promote meiotic 
recombination. 
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Figure 9.7. Three-step scenario in the evolution of the glycophorin gene family. 

A: GYPA, B: GYPB, E: GYPE. In the second step, unequal Alu-Alu recombination 
created the subsequently duplicated B/E gene locus. Vertical triangles denote Alu repeats, 
open boxes the genes A, A’, B/E, B, and E, whereas a filled box — ‘precursor’ sequence (P) 
that became an integral part of the contemporary genes B and E (redrawn from Labuda et 
al., 1995). 


High resolution genetic mapping studies have provided evidence for both 
‘hotspots’ and ‘coldspots’ of recombination (Nagaraja et al., 1997; Shiroishi et al., 
1993). A frequent disease-associated recombinational hotspot in the human 
genome occurs at 17p11.2-p12 where a CMT1A-REP tandem repeat mediates 
meiotic crossing over events through chromosome misalignment. Humans and 
chimpanzees have two copies of this repeat whilst gorilla, orangutan and gibbon 
have only a single copy (Kiyosawa and Chance, 1996). The CMT1A-REP repeat 
must therefore have appeared before the divergence of chimpanzee and human. 
The repeat contains a mariner-like element in the vicinity of the recombination 
hotspot and, since this element occurs in association with the CMT1A-REP 
repeat in all primates, it must have predated the emergence of the proximal and 
distal copies of the repeat in the human-chimpanzee common ancestor (Kiyosawa 
and Chance, 1996). Interestingly, the mariner element of Drosophila exhibits 
sequence homologies to transposons from Caenorhabditis elegans as well as hep- 
tamer and nonamer signal sequences of the vertebrate immunoglobulin somatic 
recombination pathway (Dreyfus 1992). Whether these homologies that tran- 
scend the invertebrate-vertebrate divide are merely coincidental, or whether they 
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reflect either a common evolutionary origin or alternatively convergent evolu- 
tion, is as yet unclear. 

Recombination may have been involved in at least some instances of ‘exon 
shuffling’, the process of intron-mediated recombination by which functional 
domains encoded by one or more exons are dispersed to a variety of different 
proteins (reviewed in more detail in Chapter 3, section 3.6). For example, a com- 
bination of X-ray crystallographic data and studies of exon organization have led 
to the conclusion that the two exons encoding the nucleotide-binding domain 
have been independently transferred to the human phosphoglycerate kinase 
(PGK1; Xq13.3) gene as well as to the maize alcohol dehydrogenase and chicken 
glyceraldehyde-3-phosphate dehydrogenase genes (Michelson et al., 1985). Whilst 
the almost ubiquitous presence of introns may permit exon shuffling by intron- 
mediated recombination (see Chapter 3, section 3.6), it has been suggested 
(Matsuo et al., 1994) that the presence of short unconserved introns with different 
types of splice junction may serve to prevent recombination between exon clus- 
ters encoding highly conserved functional domains without inhibiting the poten- 
tial for domain (multiexon) shuffling. 


9.4.2 V(D)J recombination 


It is natural selection that gives direction to changes, orients chance, and 
slowly, progressively produces more complex structures, new organs, and new 
species. Novelties come from previously unseen association of old material. To 
create is to recombine. 

E Jacob (1977) 


In the lymphocyte, functional immunoglobulin and T-cell receptor genes are 
assembled from their constituent gene coding segments by V(D)J recombination 
(Gellert, 1997; see Chapter 4, section 4.2.4, Immunoglobulin genes and T-cell receptor 
genes). Recombination is directed by recombination signal sequences (RSS) adja- 
cent to each coding sequence. The RSS comprises conserved heptamer and non- 
amer motifs separated by a non-conserved spacer region of either 12 bp or 23 bp 
(termed 12 signals and 23 signals); recombination only occurs between a 12 signal 
and a 23 signal. V(D)J recombination is initiated by the products of two lym- 
phoid-specific intronless recombination-activating genes, RAG] and RAG2, 
closely linked to each other on human chromosome 11p13. The RAGI and RAG2 
proteins bind the two recombination signals, bringing them into close juxtaposi- 
tion and cleaving the DNA molecule thereby separating the recombination sig- 
nals from the flanking coding segments. The RAGI and RAG2 genes are present 
in all vertebrates but have not been reported either in jawless vertebrates or inver- 
tebrates. 

Together RAG1 and RAG2 constitute a transposase that is capable of excising a 
DNA segment containing recombination signals from a donor site and inserting 
it into a target DNA molecule (Agrawal et al., 1998; Hiom et al., 1998). The prod- 
uct of transposition contains a short target site duplication immediately flanking 
the transposed fragment that is reminiscent both of retroviral integration and the 
insertional mechanisms employed by other transposases. It is therefore reason- 
able to speculate that the RAG/ and RAGZ2 genes may once have been contained 
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Figure 9.8. Model for the evolution of the immunoglobulin and T-cell receptor (TCR) 
genes and the role of RAG-mediated transposition (from Agrawal et al., 1998). 

(a) Putative structure of the recombination activating gene (RAG) transposon that 
integrated into the germline of a vertebrate ancestor. Arrows denote direction of 
transcript. (b) The split organization of extant immunoglobulin and TCR genes may have 
arisen by RAG-mediated transposition through the introduction of signal end/signal end 
elements into a primordial receptor gene exon, thereby dividing the exon into two or 
three gene segments, each flanked by one or two recombination signals (triangles). These 
gene segments would be the evolutionary precursors of current V, D, and J gene segments. 
Different patterns of gene duplication could have given rise to both the mammalian 
heavy chain locus and the ‘cluster’ configuration in cartilaginous fishes. 


within a transposable element that also possessed flanking RSS sequences (Figure 
9.8). Since in extant genomes, the RAG genes and the RSS elements are unlinked, 
transposition now serves to relocate pairs of RSS elements plus the intervening 
sequence without the necessity of the RAG genes themselves being displaced. 

The split nature of the vertebrate immunoglobulin light chain and T-cell recep- 
tor & and y genes could have originated from the germline insertion of the RAG 
transposon into an ancestral receptor gene soon after the divergence of jawed and 
jawless fishes (Agrawal et al., 1998; Hiom et al., 1998; Thompson, 1995; Figure 
9.8). This gene could then only have been expressed if the inserted transposon 
were excised by the RAG proteins and the two ends of the exon rejoined and 
repaired. The tripartite structure characteristic of the immunoglobulin heavy 
chain and T-cell receptor B and 6 genes could have arisen as a result of the further 
insertion of a second RAG transposon into the same exon. Subsequent duplica- 
tion of the individual gene segments or of the entire gene would then have served 
to generate the ‘mammalian-type’ or ‘cluster-type’ organization of mammals and 
cartilaginous fishes, respectively (Figure 9.8; Litman et al., 1993). It therefore 
appears likely that our early vertebrate ancestors were successful in taming a 
transposon by transforming it into a site-specific recombinase that was then har- 
nessed in the cause of generating the genetic diversity so vital for the flexible 
adaptive response of our immune system. 
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9.5 Gene conversion 


Gene conversion is the ‘modification of one of two alleles by the other’ (Vogel and 
Motulsky, 1997). The end result is very similar to that resulting from a double 
unequal crossing-over event and in humans, it is difficult to distinguish the two 
processes (Cooper and Krawczak, 1993). However, in practical terms, the differ- 
ence is that the correction of an ‘acceptor’ gene or DNA sequence by gene con- 
version is nonreciprocal leaving the ‘donor’ sequence physically unchanged. 

The process of gene conversion may involve the whole or only a part of a gene 
(Dover, 1993) and usually occurs between highly homologous, but nonallelic 
genes. Gene conversion may be suspected in cases where the degree of sequence 
homogeneity exhibited by related genes is much greater than expected from what 
is known of their evolutionary history. Thus, two or more gene sequences may 
exhibit strong homology over all or parts of their coding or promoter sequences 
yet they are known from phylogenetic studies to have diverged a considerable 
time ago and might reasonably have been expected to have accumulated signifi- 
cant numbers of both synonymous and non-synonymous substitutions. 

There are now numerous examples of gene conversion causing human gene 
pathology (Cooper and Krawczak, 1993; see Chapter 6, section 6.1.6) but gene 
conversion has also had an important influence in evolution. Examples of gene 
conversion in human gene evolution often involve duplicated gene sequences 
lying in close physical proximity: the G-y (HBG2; 11p15.5) and A-y (HBG1; 
11p15.5) globin genes (Shen et al., 1981; Scott et al., 1984; Stoeckert et al., 1984; 
Powers and Smithies, 1986; Fitch et al., 1990; see Figure 9.9), the B- and 6-globin 
genes (HBB, HBD; 11p15.5; Koop et al., 1989), the al-(HBA1) and «2-(HBA2) 
globin genes (Liebhaber et al., 1981), the HLA-DQOBI loci (6p21.3; Wu et al., 
1986), the visual pigment (GCP and RCP; Xq28; Kuma et al., 1988; Shyue et al., 
1994; 1995; Winderickx et al., 1993; Zhou and Lee, 1996) genes, the rhesus blood 
group (RHD, RHCE; 1p34-p36.2; Carritt et al., 1997) genes, the salivary 
(AMYIA, AMYIB, AMYIC; 1p21) and pancreatic (AMY2A and AMY2B; 1p21) 
amylase genes (Gumucio et al., 1988), the o-interferon genes IFNAI and IFNA13 
(9p22; Todokoro et al., 1984), the haptoglobin genes (HP, HPR; 16q22.1; 
Erickson and Maeda, 1994; Maeda, 1985), the p15 and p16 cyclin-dependent 
kinase inhibitor (CDKN2B and CDKN2A; 9p21) genes (Jiang et al., 1995), the 
glycophorin genes GYPA and GYPE (4q28.2-q31.1; Kudo and Fukuda 1994; 
Onda and Fukuda 1995), the P glycoprotein 1 and 3 genes (PGY1, PGY3; 7q21.1; 
van der Bliek et al., 1988), the salivary proline-rich protein genes (PRB1I, PRB2, 
PRB3, PRB4; 12p13.2; Kim et al., 1993), the cardiac &- and B-myosin heavy 
chain (MYH6, MYH7; 14q12; Epp et al., 1995) genes, the chorionic somatomam- 
motropin (CSH1, CSH2; 17q22-q24) genes (Hirt et al., 1987), the immunoglobu- 
lin C,,a GHAI and IGHA2, 14q32; Kawamura et al., 1992) and V,, genes JGHV, 
14q32; Haino et al., 1994), and the al-acid glycoprotein (ORM1, ORM2; 
9q34.1-34.3) genes (Merritt et al., 1990). 

In the context of gene conversion, if there is any middle ground between 
pathology and evolution, it is probably occupied by the cytochrome P450 
CYP2A6 gene (19q13.2). A null CYP2A6 variant, formed by gene conversion 
between the CYP2A6 gene and the closely linked but inactive CYP2A7 gene 
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Figure 9.9. History of y-globin gene evolution and conversion events in the great apes 
(from Scott et al., 1984). A duplication of a y-globin gene encoding glycine at position 136 
occurred in the early catarrhine primates about 35 Myrs ago. Since a glycine is encoded at 
this position in both y-globin genes of Pongo and Old World monkeys whereas a 
replacement with alanine in the 3’ gene is found in Homo and Gorilla, this change must 
have occurred after the divergence of Pongo (about 8 Myrs ago) but before Homo and 
Gorilla branched off (about 6-7 Myrs ago). This replacement may have occurred before 
the first conversion (C1), or it may have happened after it but still before the separation 
of Homo and Gorilla. If the Gly—Ala replacement occurred before C1, the 3’ boundary of 
Cl can be placed in exon 3 at codon position 135 or nucleotide position 1543. If the 
replacement occurred after C1, the 3’ boundary can be placed at the start of the 3’ 
untranslated region. Cl is common to both humans and gorillas and was estimated to 
have occurred about 10 Myrs ago, using a replacement rate change of 1% for every 10 
Myrs. 

No further conversions have been identified in the gorilla lineage, but three have been 
identified in the human lineage. C2 and C3 are estimated to have occurred about 2-3 
Myrs ago either in a common ancestor of human chromosome types A and B or in an 
early version of chromosome B itself. Conversion C2 is evident in the BAy-gene from 
positions 901 to 1128 and extending into the ‘hot spot,’ and C3 is also evident in the BAy- 
gene, but from positions 42 to 777. C4 extending over some 1500 bp is estimated to have 
occurred no earlier than 1 Myrs ago on human chromosome type A. Its effects are evident 
in the AAy-gene from positions 42 to 1128 with its 3’ boundary being located in the hot 
spot region. Taken from Scott et al. (1994) with kind permission. 
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occurs at polymorphic frequencies in the Japanese (28%) and African-American 
(2.5%) populations (Fernandez-Salguero et al., 1995). 

Presumably, the greater the number of homologous repetitive elements, the 
greater the opportunity for gene conversion. Consistent with this assertion, multi- 
gene families frequently exhibit extensive sequence homogeneity compatible 
with the consequences of gene conversion. Thus, the human ribosomal RNA gene 
family, which is composed of ~400 members arranged in tandem repeats on five 
different chromosomes (RNRI1, 13p12; RNR2, 14p12; RNR3, 15p12; RNR4, 
21p12; RNRS, 22p12) are much more similar to one another than they are to 
members of the rDNA family in other primates (Arnheim et al., 1980; Li, 1997). 
This genomic organization also results in a greater degree of homogeneity within 
rDNA clusters than between them (Gonzalez et al., 1992; Seperack et al., 1988), a 
phenomenon also noted for the polyubiquitin genes (UBA52, 19p13.1; UBB, 
17p11.1-p12; UBC, 12q24.3) where gene conversion appears to occur exclusively 
within rather than between gene clusters (Sharp and Li, 1987). Other examples of 
repetitive gene families being subject to the homogenizing effects of gene conver- 
sion are those of the human immunoglobulin Ca GHA]; 14q32-q33; Kawamura 
et al., 1992; McCormack et al., 1993), Vx light chain JGKV; 2p12; Huber et al., 
1993), y heavy chain JGHG1; 14q32.33; Lefranc et al., 1986) and A light chain 
(IGLC1,; 22q11.12; Udey and Blomberg, 1988) genes, U2 snRNA (RNU2; 17q21- 
q22; Liao et al., 1997) genes and the T-lymphocyte antigen receptor B (TCRB; 
14q11.2; Funkhouser et al., 1997; Tunnacliffe et al., 1985) genes. 

The pituitary-expressed growth hormone (GH1; 17q22-q24) gene promoter 
region has been found to exhibit a very high level of sequence polymorphism with 
8 variant nucleotides within a 134 bp stretch (Giordano et al., 1997). These eight 
variable positions have been ascribed to 12 different haplotypes ranging in fre- 
quency from 2% to 31% in the general population. Since they occur in the same 
positions in which the GHI gene differs from the other placentally expressed 
genes of the growth hormone cluster [two chorionic somatomammotropin (CSH1 
and CSH2) genes, a chorionic somatomammotropin ‘pseudogene’ (CSHL1/) and a 
second growth hormone gene (GH2)], the mechanism responsible is likely to be 
gene conversion with the placentally expressed genes serving as donors of the 
converted sequences. Various examples of gene conversion of functional genes 
templated by pseudogenes have been documented as a cause of human pathology 
(discussed in Chapter 6 section 6.1.6). 

Studies of gene conversion in fungi have shown that gene conversion occurs not 
only between alleles but also intrachromosomally between duplicated sequences 
on either the same chromatid or the sister chromatid and even between sequences 
on nonhomologous chromosomes. Does gene conversion in the human genome 
occur predominantly within a single chromosome (i.e. through sister chromatid 
exchanges) or instead between either homologous or nonhomologous chromo- 
somes? Data from both the human ribosomal RNA (Seperack et al., 1988) and U2 
snRNA gene families (Liao et al., 1997) have suggested that intra-chromosomal 
events are more frequent than inter-chromosomal events. Inter-chromosomal gene 
conversion events have been reported in human gene pathology (see Chapter 6, 
section 6.1.6) but these are comparatively rare. In an evolutionary context, 
possible examples of gene conversion operating between homologous loci on 
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nonhomologous chromosomes are provided by non-reciprocal exchanges between 
the X-linked and autosomal phosphoglycerate kinase (PGK1, Xq13; PGK2; chro- 
mosome 19) genes and the X-linked and autosomal pyruvate dehydrogenase 
(PDHA1, Xp22.1-p22.2; PDHA2; 4q22-q23) genes (Fitzgerald et al., 1996). 

Although gene conversion is generally considered to involve only short 
stretches of DNA sequence, it has often not been possible to determine the precise 
length of the converted sequence owing to the high degree of homology mani- 
fested by the genes involved. Papadakis and Patrinos (1999) reviewed data from 
the human Gy- (HBG2; 11p15.5) and Ay- (HBG1; 11p15.5) globin genes and con- 
cluded that the length of the converted fragments is usually less than 400 bp but 
can vary from as little as 113 bp to as much as 2266 bp. 

The mechanism underlying gene conversion remains elusive but must presum- 
ably entail the close physical interaction between homologous DNA sequences. It 
may involve heteroduplex formation followed by mismatch repair. Both Amor et 
al. (1988) and Matsuno et al. (1992) have noted the presence of Chi-like sequences 
(GCTGGGG; known to promote recombination both in E. coli and in mouse 
immunoglobulin genes; Smith, 1983) in the vicinity of regions of the 
CYP21/CYP21P and HBB genes. These authors speculated that the Chi-like 
sequences might play a role in gene conversion events. Various other sequences 
e.g. Alu repeat sequences (Merritt et al., 1990), a retroviral Long Terminal Repeat 
(Pavelitz et al., 1995), alternating purine.pyrimidine tracts (Papadakis and 
Patrinos, 1999) and a (CT),-(GA), microsatellite (Liao and Weiner, 1995) have 
also been postulated to be involved in promoting gene conversion. Many gene 
conversion events involving the HBG/ and HBG2 genes appear to terminate near 
a polypyrimidine stretch (Fitch et al., 1990). Papadakis and Patrinos (1999) sug- 
gested that palindromic sequences may be associated with the termination of gene 
conversion, possibly through secondary structure formation blocking branch 
migration. The insertion of some transposable elements into members of multi- 
gene families can however reduce the rate, and limit the extent of, gene conver- 
sion by reducing the degree of homology between potential donor and acceptor 
sequences (Hess et al., 1983; Schimenti and Duncan, 1984). 

Gene conversion has thus had an important influence on the evolution of 
multigene families by homogenizing the sequences of duplicated or repeated 
genes. Sequence homogenization can of course hinder divergence and hence 
potential adaptation. New substitutions occurring in a sequence will tend to be 
lost because they are likely to be converted back to the sequence of the more com- 
mon allele in the population. However, gene conversion may also promote diver- 
sity by introducing multiple sequence changes simultaneously into different 
members of large gene families as with the genes encoding the proteins of the 
major histocompatibility complex (HLA-B, HLA-DRA; 6p21.3) (Belich et al., 
1992; Gorski and Mach, 1986; Kuhner and Peterson, 1992; Kuhner et al., 1991; 
Ohta, 1991; Parham and Lawlor, 1991; Seemann et al., 1986) and the kallikreins 
(Ohta and Basten, 1992). 


9.6 Gene creation by retrotransposition 


A mutation that produces a new elementary species is due to the sudden 
appearance or creation of a new element — a new gene. Put in another way, we 


412 HUMAN GENE EVOLUTION 


witness at mutation the birth of a new gene or at least its activation. The num- 
ber of active genes in the world has been increased by one. 
T.H. Morgan (1926) The Theory of the Gene 


During the evolution of the mammalian genome, new genes have often emerged 
by duplication and divergence (Chapter 4 and Section 9.5). Another, albeit less 
common route has been via retrotransposition (Brosius, 1991; Nouvel, 1994). At 
least five functional human genes have a structure consistent with their having 
arisen by retrotransposition: the phosphoglycerate kinase 2 (PGK2; 19) gene 
(Boer et al., 1987; McCarrey and Thomas, 1987; McCarrey, 1990), the pyruvate 
dehydrogenase 02 (PDHA2; 4q22-q23) gene (Dahl et al., 1990), the cAMP-depen- 
dent protein kinase Cy subunit (PRKACG;; 9q13) gene (Reinton et al., 1998), the 
SR splicing factor SRp46 (Srp46; 11q22; Soret et al., 1998) gene and the Y-encoded 
zinc finger protein (ZFY: Yp11.3) gene (Ashworth et al., 1990). All are character- 
ized by a lack of introns and a chromosomal location different and distinct from 
their presumed parent genes (PGK1, Xq13; PDHA1, Xp22.1-p22.2; PRKACA, 
19p13.1; ZFX, Xp21.3-p22.2 respectively). In the case of the PRKACG gene, 
there are also remnants of flanking direct repeats and a poly(A) tail providing fur- 
ther evidence of the retrotranspositional origin of the gene (Reinton et al., 1998). 
This type of sequence has sometimes been referred to as a ‘functional retro- 
pseudogene’ but this is surely a misnomer. Functional it may be, but pseudogene 
it is not, and so a better term would be ‘retrotransposed gene’. 

The retrotransposed PGK2 gene originated prior to the divergence of eutherian 
and metatherian mammals some 125 Myrs ago and has been characterized in 
some detail (McCarrey, 1990). It is expressed only in spermatogenic cells whereas 
its putative parent (PGK) is ubiquitously expressed. Sequence comparison stud- 
ies are consistent with the newly retrotransposed PGK2 gene having possessed 
regulatory sequences derived from the PGKI gene which facilitated the initial 
expression of the intronless gene. Inclusion of promoter sequence could have 
come about if the PGK1-derived PGK2 mRNA intermediate had initiated aber- 
rantly at an upstream start site as McCarrey (1990) suggested, or instead from an 
alternative upstream promoter. This promoter sequence might then be envisaged 
to have evolved cell type specificity, a process presumably involving loss of the 
CpG island still apparent in the PGKI gene. Interestingly, the PDHA2 and 
PRKACG genes are also expressed exclusively in the spermatogenic cells of the 
testis. In the case of the X-linked PGKI1 and PDHAI1 genes, transposition to an 
autosome might have been selectively advantageous in that it would have ensured 
expression of these important enzyme encoding genes in Y-bearing as well as X- 
bearing spermatozoa. 
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10 


Molecular reconstruction 
of ancient genes/proteins 


10.1 Introduction 


Proteins from extinct organisms can be studied by the analysis of DNA recovered 
from preserved organic specimens. This approach, however, requires biological 
material that has been completely protected from the oxidative effects of oxygen 
and water (Audic and Beraud-Colomb, 1997; DeSalle et al., 1992; Pääbo, 1993; 
reviewed by Li, 1997) and significant doubts may remain as to the authenticity of 
the DNA recovered (Austin et al., 1997; Stoneking, 1995). Alternatively, an 
attempt can be made to reconstruct the amino acid sequence of an ancient protein 
from the sequences of its extant descendants (Malcolm et al., 1990; Shih et al., 
1993; Stackhouse et al., 1990). This methodology relies upon the principle of max- 
imum parsimony (i.e. the assumption that extant proteins have evolved from that 
of an extinct ancestor by the smallest number of mutational changes). DNA 
sequence data can also be treated in the same way (Hillis et al., 1993), one example 
being the investigation of ribosomal DNA phylogeny (Friedrich and Tautz, 1995). 

Jermann et al. (1995) examined the evolutionary history of artiodactyl RNase A, 
a pancreatic digestive enzyme, derived the ancestral enzymes phylogenetically 
and reconstructed them by site-directed mutagenesis. The kinetic properties of 
the reconstructed enzymes were found to be similar to those of extant RNases but 
less stable to thermal denaturation, more susceptible to proteolysis and five-fold 
more active toward double-stranded RNA. A Gly38—Asp substitution, which 
occurred ~40 Myrs ago around the time when foregut rumination evolved, was 
found to be associated with the reduction in activity toward double-stranded 
RNA. 

A similar analysis has been attempted with the chymases (mast cell proteases) 
which hydrolyse angiotensin I to generate angiotensin II, a potent vasoconstrictor 
hormone. Primate o-chymases are highly specific and only hydrolyse the Phe8- 
His9 peptide bond. By contrast, rat B-chymase is less specific and further hydrol- 
yses angiotensin I by cleavage of the Tyr4-Ile5 bond. Chandrasekharan et al. 
(1996) determined the phylogeny of four mammalian o-chymases and six mam- 
malian B-chymases and reconstructed the putative ancestral enzyme by chemical 
synthesis. This enzyme proved capable of cleaving angiotensin I at the Phe8-His9 
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bond but not at the Tyr4-Ile5 bond. Thus the relatively narrow specificity of the 
primate o-chymase represents the ancestral state of the enzyme while the broader 
specificity of the rat B-chymase must have been secondarily derived. 


10.2 Molecular reconstruction and homology modelling of 
the catalytic domain of the common ancestor of the 
hemostatic vitamin K-dependent serine proteases 


Adopting the maximum parsimony principle and employing a novel cDNA-based 
strategy, Krawczak et al. (1996) reconstructed the catalytic domains of the early 
mammalian ancestors of the vitamin K-dependent factors and then went on to 
reconstruct the putative common ancestor of all five proteins from an earlier stage 
of vertebrate evolution. A tertiary model of the ancestral vitamin K-dependent 
serine protease was built that was both energetically satisfactory and possessed a 
credible fold, and its topological and biophysical properties were examined. 
Wacey et al. (1997) then traced the evolution of specific structural features in pro- 
tein C from the common ancestor of the vitamin K-dependent serine proteases 
toward extant human protein C. These studies will now be described in some 
detail since they serve to illustrate the potential value of homology modelling in 
evolutionary studies. 


70.2.1 The evolution of the vitamin K-dependent coagulation factors 


The vitamin K-dependent serine proteases of coagulation (factors VII, IX, and X, 
prothrombin and protein C) exhibit substantial sequence and structural homol- 
ogy (Greer, 1990). Factors VI, IX, X and protein C all contain an N-terminal 
domain of glutamic acid (Gla) residues, two epidermal growth factor-like (EGF) 
domains and a catalytic domain (Blake et al., 1987); prothrombin differs slightly 
in that it possesses two kringle domains instead of the EGF domains (Gojobori 
and Ikeo, 1994). With the exception of prothrombin, the genes encoding the vita- 
min K-dependent coagulation factors have a very similar exon/intron organiza- 
tion (Tuddenham and Cooper, 1994), suggesting that they have arisen from a 
relatively recent common ancestor through a process of gene duplication and 
divergence (Neurath, 1984; Patthy, 1985; Patthy, 1990). The organization of these 
genes also reflects the functional modular assembly of the respective proteins and 
is thought to have emerged through exon shuffling (Chapter 3 section 3.6). 
Evolution of these genes has proceeded by repeated insertions, duplications, 
exchanges and deletions of modules. The presence/absence of modules such as the 
calcium-binding Gla domains, the EGF-like domains and kringles was used by 
Patthy (1990) to reconstruct the evolutionary past of the genes (Figure 10.1). More 
recently, it has been realised, however, that each module or domain has its own 
distinctive evolutionary history as a result of different evolutionary pressures 
(Ikeo et al., 1995). 

Doolittle (1993) also proposed a tentative scheme for the evolution of the vita- 
min K-dependent coagulation factors (Figure 10.1). First, an ancestral prothrom- 
bin emerged by serine protease gene duplication and the acquisition of Gla and 
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Figure 10.1. Proposed phylogenies for the vitamin K-dependent factors of coagulation. A: 
Krawczak et al. (1996), B: Doolittle (1993); C: Doolittle and Feng (1987), D: Patthy 
(1990). 


EGF domains. Prothrombin’s EGF domain(s) served as a site for the binding of 
tissue factor which at this time possessed the ability to activate it to yield throm- 
bin. After the emergence of fibrinogen, factor X appeared as a result of a pro- 
thrombin gene duplication. The ability of factor X to activate prothrombin 
released the latter from its dependence on tissue factor. Factor VII, duplicated 
from factor X, was able to bind tissue factor and to activate factor X. Factor IX 
emerged last, again duplicated from factor X. Prothrombin then acquired kringle 
domains via exon shuffling (Rogers, 1985) allowing it to bind fibrin. The con- 
comitant loss of prothrombin’s EGF domains abolished its now redundant inter- 
action with tissue factor. The plausibility of this scheme was tested by Krawczak 
et al. (1996) whose approach is described in subsequent sections. 


10.2.2 Reconstruction of mammalian ancestral cDNAs 


Krawczak et al. (1996) reconstructed the catalytic domains of the early mam- 
malian ancestors of the vitamin K-dependent factors and the common ancestor of 
all five proteins. The study of Krawczak et al. (1996) relied upon mammalian phy- 
logenies from different sources (Nei, 1987; Novacek, 1992; Vogel and Motulsky, 
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1986) and included cDNA sequences from human, macaque, sheep, pig, rabbit, rat, 
mouse, dog, and cow. Conserved cDNA blocks were used as anchors to align inter- 
vening non-conserved cDNA blocks for a particular protein, and since protein 
function must have been retained during evolution, frameshift mutations were 
precluded in all alignments. For each protein, the authors then deduced from the 
alignments the most likely cDNA sequence at each node of the mammalian phylo- 
genies employed, including the roots representing the respective mammalian 
ancestors. A small number of ambiguities remained as to the ancestral mammalian 
cDNA sequences. These were resolved by reference to relative nearest neighbor- 
dependent mutation rates in humans. Use of human-derived parameters in this 
context was justified on the basis of the long-term evolutionary stability of relative 
single base-pair substitution rates (Krawczak and Cooper, 1996). 

The evolution of a gene family whose members acquire different functions is 
invariably accompanied by rapid amino acid sequence divergence in functionally 
important regions (Ohta, 1991, 1994). Indeed, specific examples of this phenome- 
non include the ‘accelerated evolution’ (hypervariability) of the reactive center 
regions of serine protease inhibitors (Hill and Hastie, 1987) and the active site 
regions of some serine proteases (Creighton and Darby, 1989). For humans, the 
mutation rates derived during the reconstruction process were found by 
Krawczak et al. (1996) to be significantly higher in the factor VII, factor X and 
protein C lineages than in the factor IX and prothrombin lineages. Similarly in 
the dog, the factor VII lineage exhibited a higher mutation rate than that of factor 
IX. That factor IX exhibited a lower mutation rate than the other proteins was 
indicative of its early emergence. Once adapted to its functions, factor IX would 
have had to change rather less over evolutionary time than the other proteins of 
more recent origin which still had to adapt to their new-found roles. For factor 
IX, protein C and prothrombin, mutation rate differences were also apparent 
between species and exhibited an inverse correlation with generation time. 
Consistent with previous results (Britten, 1986; Collins and Jukes, 1994), the 
mutation rate in humans was much less than that found in rodents. 


70.2.3 Evolutionary divergence of the vitamin K-dependent coagulation 
factors 


Krawczak et al. (1996) identified a number of highly conserved regions in their 
reconstructed ancestral mammalian cDNA sequences. Mismatches in these 
regions were classified on the basis of whether they corresponded to either a silent 
or a missense mutation during the process of evolutionary divergence. 
Interestingly, the numbers of silent and missense mutations did not correlate with 
each other. This finding was interpreted in terms of the existence of two distinct 
molecular clocks: one would be based upon silent mutations and would run con- 
stantly after the divergence of any two sequences. The other would be based upon 
missense mutations which would continue to run immediately after gene dupli- 
cation but would stop, or at least be dramatically slowed, once the new protein 
product had acquired, and then become adapted to, its new biological function. 
Since by far the smallest number of silent mutations in conserved codons had 
occurred since the divergence of factors VII and X, Krawczak et al. (1996) 
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concluded that these two proteins must have their most recent root in common 
(Figure 10.1). This assertion is not inconsistent with the fact that the human fac- 
tor VII (F7) and factor X (F10) genes are not only syntenic but also very closely 
linked on chromosome 13q34, as a result of their recent emergence through a 
process of duplication and divergence. Since the conserved cDNA blocks of factor 
IX exhibited the largest number of neutral differences with respect to other pro- 
teins, this was held to imply that factor IX was the first to diverge from the other 
proteins. However, a potential pitfall with this conclusion could have been a faster 
rate of substitution at the X-linked factor IX (F9) locus as compared to the other, 
autosomal genes. However, at least for human, substitution rates in the F9 gene 
were significantly lower than in the factor F7, F10 and protein C (PROC) genes, 
which also appeared to be the case in dog. Thus, divergence time rather than a 
higher propensity to mutate was held to be responsible for the large number of 
silent substitutions which separate the F9 cDNA sequence from the other 
cDNAs. The precise order of divergence events for protein C, prothrombin and 
the common ancestor of factors VII and X could not be clarified unequivocally by 
Krawczak et al. (1996) on the basis of their data. 

The cDNA-based phylogeny of the five vitamin K-dependent factors presented 
in Figure 10.1 emphasizes the previously recognized relatedness of (i) protein C 
and prothrombin and (ii) factors VII and X. However, it is markedly different 
from other proposed phylogenies (Doolittle and Feng, 1987; Patthy, 1985; Patthy, 
1990; Doolittle, 1993; Figure 10.1). This may have been because earlier attempts 
employed alignments of amino acid rather than cDNA sequences. The phylogeny 
of Krawczak et al. (1996) was closest to that presented by Doolittle and Feng 
(1987): both phylogenies claimed a comparatively recent common root for the cat- 
alytic domains of factors VII and X. However, the major difference lies in the 
much earlier branching out of the catalytic domain of factor IX, postulated by 
Krawczak et al. (1996). 

Gene duplication has played a very important role in the evolution of the 
genomes of higher organisms (Ohno, 1970). Indeed, there is now considerable 
evidence for saltatory increases in the number of genes around the time of the 
emergence of the vertebrates 500 Myrs ago (Bird, 1995). These increases appear to 
have been caused by the duplication and subsequent divergence of many different 
gene sequences (see Chapter 2, section 2.1). It is still impossible to say for certain, 
however, when the duplication and divergence of the vitamin K-dependent serine 
proteases of coagulation occurred during vertebrate evolution. Prothrombin is 
present in bony fish (trout), cartilaginous fish (dogfish) and in the hagfish, one of 
the modern representatives of the jawless Agnatha (Banfield and MacGillivray, 
1992; Doolittle, 1993). Thus, although thrombin is found in the most primitive of 
modern vertebrates, there is no evidence for its existence in either protochordates 
or echinoderms. Whether the other four vitamin K-dependent factors of coagula- 
tion are present in these types of fish is as yet unclear (Doolittle, 1993). If they are, 
the adaptive radiation of the vitamin K-dependent factors of hemostasis must 
have occurred during the space of some 50 Myrs between the divergence of the 
protochordates and the appearance of the Agnatha, some 450 Myrs ago (Doolittle, 
1993). The processes of gene duplication and divergence that have led to the 
emergence of the five present day vitamin K-dependent factors of coagulation 
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provide further evidence of what must have been a very active phase in the evolu- 
tion of the modern vertebrate genome. 


70.2.4 Reconstruction of a common ancestor of the vitamin K-dependent 
coagulation factors 


The evolutionary scenario depicted in Figure 10.1 was based solely upon the 
highly conserved regions of the reconstructed ancestral mammalian cDNA 
sequences. However, this phylogeny was used by Krawczak et al. (1996) in an 
attempt to reconstruct the common ancestor of all five vitamin K-dependent fac- 
tors by means of molecular modelling. To this end, the less well conserved regions 
of the mammalian ancestral sequences were first aligned in the order stipulated by 
the phylogeny. The alignment was then used to determine the nucleotides present 
at each node of the phylogeny, moving upwards through the tree to its root. The 
resulting cDNA sequence, representing the putative common ancestor of all five 
vitamin K-dependent factors, contained several gaps which yielded ambiguities 
in the sequence of the ancestral protein. However, since the protein core had to 
contain a full complement of residues for meaningful molecular modelling to be 
possible, all such amino acids were replaced with the analogous residue found in 
extant human thrombin as a ‘best guess’. 

One way to examine the plausibility of the deduced amino acid sequence of the 
vitamin K-dependent factor ancestral protein would have been to express it in 
vitro and to characterize it biochemically (Malcolm et al., 1990; Shih et al., 1993). 
An alternative strategy was to construct a model of the tertiary structure of the 
protein by comparative methods and examine its topology and biophysical prop- 
erties. Such a model was created by Krawczak et al. (1996) using the X-ray crys- 
tallographic coordinates of the heavy chain of o-thrombin as a template. This 
template was aligned against the high resolution structure of seven other serine 
proteases and the amino acid sequence of the reconstructed ancestral protein 
(Greer, 1990). Sequence modifications became necessary in this process at two 
variable regions, which were longer than the analogous loops present in extant 
serine proteases, and a single unpaired Cys residue (Cys22), which was replaced 
by Ala as in extant human thrombin. Extant human thrombin was chosen for 
comparison in this and other instances since, of all of the modern hemostatic ser- 
ine proteases, it bore the strongest sequence homology to the ancestral protein. 
Moreover, it was assumed (Doolittle, 1993) that the vitamin K-dependent factor 
ancestral protein would have been capable of performing the end effector function 
of thrombin in the hemostatic cascade, that is the cleavage of soluble fibrinogen 
to generate insoluble fibrin clot. 

The putative ancestral protein was found to contain 86 charged groups at pH 
7.0, very similar to the number (89) found in extant human thrombin. In both 
cases, the great majority of these charges were noted to be accessible to water. The 
global electrostatic distribution across the ancestral protein’s surface was rela- 
tively uniform by comparison with the dipolar distribution evident in extant 
human thrombin. Both structures still possessed an unbroken equatorial belt of 
negative charge but the extent of the electrostatic field strength in the ancient 
protein was not as strong as that of extant thrombin. It was speculated that the 
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increase in charge intensity and dispersal over evolutionary time reflected a trend 
toward increasing protein binding specificity. 

The fibrinogen-binding exosite (P1) was found to have greatly increased in size 
during the evolution of thrombin. From the electostatic contour map, only five 
Arg (126, 165, 233) and three Lys residues (230, 243, 245) of the putative ancestral 
protein appeared to contribute to a small patch of positive charge (Figure 10.2). 
Moreover, several thrombin residues known to be important in the binding of fib- 
rinogen (Tsiang et al., 1995) were not present in the ancient protein suggesting 
that the ancestral protease bound fibrinogen only weakly (Figure 10.3). A number 
of residues in thrombin have been shown to be involved in the binding of protein 
C (Tsiang et al., 1995) and these are also components of the fibrinogen binding site 
(Lys36, Trp60D, Lys70, His71, Arg73, Tyr76, Arg77A, Lys81, Lys109, Lys110, 
Glu217, Arg221A; Figure 10.3) . However, only a fraction of these residues were 
present in the ancestral protein (Irp60D, His71, Arg77A, Lys110, Glu217, 
Arg221A). This was not surprising since, at the dawn of vertebrate evolution, pro- 
tein C had yet to evolve from the vitamin K-dependent factor ancestral protein 
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Figure 10.2. Stereo views of the electrostatic profiles of the putative vitamin K-dependent 
factor ancestral protein (above) and extant human thrombin (below). The view is towards 
the active site canyon. The positions of the fibrinogen-binding site (P1) and heparin- 
binding site (P2) are indicated. A large positively charged patch (P3) is also present on 
the vitamin-K dependent factor ancestral protein (after Krawczak et al., 1996). 
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Figure 10.3. Schematic view of the active site canyon (bold contour) of the vitamin-K- 
dependent factor ancestral protein (Krawczak et al., 1996). The active site triad (His57, 
Asp102, Ser195, chymotrypsin numbering) is denoted by a triangle. 

H: heparin-binding site, G: glycosylation site, C: chemotactic region, R: RGD sequence, 
A: aryl-binding site, F: fibrinogen-binding exosite, S: specificity sites, N-terminal to 
cleavage, S’: specificity sites C-terminal to cleavage. Residues that are conserved between 
the vitamin-K-dependent factor ancestral protein and extant human thrombin are 
circled. Those residues that have not been conserved are circled with a broken line. 


(Doolittle, 1993). By contrast, both thrombin residues implicated in binding 
thrombomodulin (G1n38, Arg75; Tsiang et al., 1995), the endothelial cell surface 
thrombin receptor, were present in the ancestral protein. Thus an ancient throm- 
bomodulin-like molecule might have been able to interact with the vitamin K- 
dependent factor ancestral protein. 

When the residues responsible for binding thrombomodulin and protein C 
were excluded from the fibrinogen binding patch, five of the remaining seven 
residues were found to be conserved between the vitamin K-dependent factor 
ancestral protein and extant human thrombin (Lys60K Asn60G, Asp186A, 
Lys186D, Glu192) (Figure 10.3). Both of the altered residues (Thr60I, Asp222) 
exhibit conservative changes (to Ser and Glu, respectively). These seven residues 
may have constituted the original fibrinogen binding patch that was to increase in 
size and binding affinity as well as diversifying functionally over evolutionary 
time. 
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The extensive positive electrostatic charge associated with the heparin binding 
site in extant thrombin was much smaller in size and field strength in the model 
of the putative ancient protein (Figure 10.2). In prothrombin, the second kringle 
domain interacts with the heparin-binding site to slow down antithrombin 
II/heparin-mediated inhibition (Arni et al., 1993) prior to proteolytic prothrom- 
bin activation. The primitive heparin binding site evident in the ancient protease 
would have been unable to bind the kringle 2 domain as strongly as its extant 
descendant. This is consistent with the view that the ancient protein contained a 
light chain of Gla and EGF domains, the latter only being replaced by kringles at 
a later stage in the evolution of the protease (Patthy, 1985). The heparin-binding 
site may then have evolved in such a way as to balance the dual requirements of 
antithrombin III/heparin-mediated inhibition of thrombin and the kringle 2- 
mediated protection of prothrombin from premature inhibition. 


10.2.5 The evolution of human protein C 


One of the best characterized vitamin K-dependent serine proteases of coagula- 
tion is protein C. Activated protein C (APC) exerts a negative feedback regulatory 
effect on both the intrinsic and extrinsic pathways of coagulation through the 
proteolytic inactivation of factors Va and VIIIa in the presence of protein S and a 
negatively charged phospholipid surface (Tuddenham and Cooper, 1994). Protein 
C is activated by thrombin through cleavage of the Arg169-Leul70 bond with 
release of a dodecapeptide from the heavy chain. This reaction is enhanced some 
20 000-fold by thrombomodulin, an endothelial cell surface glycoprotein which 
binds thrombin with high affinity (reviewed by Esmon, 1995). 

The substrate specificity of serine proteases is largely dependent upon the 
structure and properties of the substrate-binding pocket adjacent to the active 
site. By analogy with other trypsin-like serine proteases, Asp189 (chymotrypsin 
numbering) of APC forms the bottom of this binding pocket (Segal et al., 1971; 
Shekter and Berger, 1967). In principle, the presence of Ser198 at the S2 substrate- 
binding site of APC would allow the binding pocket to accommodate larger 
amino acid residues such as Phe or Leu (Stone and Hofsteenge, 1985). However, 
this theoretical diversity appears not to be capitalised upon by the native structure 
since the substrate specificity of APC is in reality confined to factors Va and VIIIa 
(Esmon, 1987). 

A molecular model of the catalytic domain of early mammalian protein C was 
also derived and was used to examine the functional architecture of that ancient 
domain and to explore its evolutionary progression from the putative common 
ancestor of the vitamin K-dependent serine proteases toward extant human pro- 
tein C (Wacey et al., 1997). This application of homology modeling to a recon- 
structed amino acid sequence made it possible to trace the evolution of structural 
features in a human protein and, in so doing, to make certain inferences regard- 
ing the development of its functional specificity. Since its first appearance in evo- 
lution as a result of a gene duplication, the catalytic domain of human protein C 
has undergone 41 amino acid changes, at least some of which have presumably 
served to ‘fine-tune’ interactions with its substrates. Thus, those sites of protein- 
protein interaction already present in early mammalian protein C should contain 
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the amino acid residues most essential for interaction and hence for the biological 
functions of the protein. 


10.2.6 Comparison of the functional architecture of early mammalian 
protein C with the putative ancestor of all vitamin K-dependent 
factors 


Six of the 13 variable regions (VR’s) in the model of early mammalian protein C 
were derived by reference to the crystal structure of factor Xa. The similarity in 
VR loop length between early mammalian protein C and extant factor X is con- 
sistent with both proteins sharing a considerable period of evolutionary history. It 
would seem reasonable to suppose that factor X (an activator of prothrombin and 
therefore a procoagulant) and protein C (an anticoagulant by virtue of its 
inhibitory action in inactivating factors Va and VIIIa) have coevolved in such a 
way as to ensure that thrombin generation is appropriately promoted yet ade- 
quately limited in response to hemostatic challenge. 

Doolittle (1993), Patthy (1990), and Krawczak et al. (1996) all concluded that 
protein C and prothrombin probably emerged at a similar stage of vertebrate evo- 
lution. In the primordial vertebrate hemostatic system, one role for protein C 
could have been to act as part of a negative feedback control mechanism. Since 
both the catalytic efficiency of thrombin and the efficiency of activation of pro- 
thrombin were presumably undergoing a continual process of optimization at 
that time, there would have been an increasing requirement for an efficient nega- 
tive regulator of thrombin. 

Any change of function from the common ancestor of the vitamin K-dependent 
factors to early mammalian protein C would have necessarily required changes in 
the active site of the protease, thereby altering its substrate specificity from fib- 
rinogen to the emerging co-factors VIIIa and Va. The ‘accelerated evolution’ of 
active sites is thought to have been important in the diversification of the sub- 
strate specificity of serine proteases, subsequent to gene duplication (Creighton 
and Derby, 1989; Ohta, 1994). From inspection of our models, changes in the 
active site region of protein C appear to have involved the elimination of the aryl 
binding site (with the exception of residue W215 which is highly conserved 
between all serine proteases) and the chemotactic binding site (Figure 10.4). Both 
of these modifications resulted from the loss of two pincer-like highly mobile 
insertion loops, Leu59-Asn62 and (to a lesser extent) Leul44-Gly150, originally 
present in the vitamin K-dependent factor ancestral protein. Since both loops are 
required for the catalytic activity of extant thrombin (Stubbs and Bode, 1993), it 
may be inferred that the early mammalian ancestor of protein C had already lost 
its ability to cleave fibrinogen. Moreover, the absence of those residues which in 
thrombin bind protein C [with the exception of Leu73, Arg75, and Asp186A 
(Tsiang et al., 1995) and see below] implies that early mammalian protein C was 
probably not auto-catalytic. 

In the early mammalian protein C molecule, the distribution of residues analo- 
gous to the fibrinogen-binding site of extant thrombin was very different from 
that predicted in the putative vitamin K-dependent factor ancestral protein 
(Figure 10.4). In terms of evolutionary conservation, the fibrinogen-binding patch 
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Figure 10.4. Schematic view of the active site canyon (bold contour) of early mammalian 
protein C (Wacey et al., 1997). The location of the active site triad residues (His 57, Asp 
102, Ser195, chymotrypsin numbering) is denoted by a shaded triangle. Residues which 
are conserved between the vitamin K-dependent factor ancestral protein and early 
mammalian protein are circled. Nonconserved residues are circled with a broken line. H: 
heparin-binding site, G: glycosylation site, C: chemotactic region, A: aryl-binding site, F: 
fibrinogen-binding exosite, S’: specificity sites C-terminal to cleavage. 


of early mammalian protein C was split along its equatorial axis. The ‘North’ 
patch had acquired three non-conservative mutations, all of which were to neutral 
Ile and Leu residues; the resulting changes in both polarity and geometry proba- 
bly reflected adaptation to new substrates viz. factors VIIIa and Va. By contrast, 
the ‘South’ patch had not experienced any non-conservative substitutions. With 
the exception of Leu73, Arg75, and Asp186A, residues analogous to the fibrino- 
gen-binding patch of extant thrombin (Stubbs and Bode, 1993) which are located 
outside of the active site were absent from early mammalian protein C. It may be, 
therefore, that early mammalian protein C was unable to bind fibrinogen. 

An electrostatic view of the early mammalian protein C molecule (Figure 10.5, 
bottom right) reveals an anionic patch (P4) in the ‘South-West’ corner of the mol- 
ecule. This patch appears to have become moderately expanded in extant human 
protein C and comprises (i) residues analogous to the fibrinogen-binding residues 
of the ‘South’ active site, (ii) Lys38, a fibrinogen-binding residue external to the 
active site, and (iii) the thrombomodulin-binding residues Lys36 and Arg75 of 
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extant thrombin. P4 also contained residues analogous to Asn36, Gln37, Glu38 
(chymotrypsin numbering) of the vitamin K-dependent factor ancestral protein 
(Glu38 belongs to the S3’ site of extant thrombin). In early mammalian protein C, 
these three residues had become lysines which would have served to increase the 
net positive charge in this area. Since this region binds thrombomodulin in 
extant protein C (Wacey et al., 1993; Greengard et al., 1994; Vinceno et al., 1995; 
Grinnell et al., 1994), the P4 site may represent the nascent thrombomodulin- 
binding patch in early mammalian protein C. Figure 10.5 depicts the envisaged 
evolutionary transformation of the primitive original fibrinogen/ thrombomod- 
ulin-binding patch into a region specialised for thrombomodulin binding. In 
extant protein C, substitution of Trp76 by Arg has served to increase further the 
anionic potential, and thus specificity, of this patch. Finally, it should be noted 
that the anionic residues of P4 (K36, K37, K38, R74, K148, R149, R151) in early 
mammalian protein C appear to have undergone relatively few subsequent substi- 
tutions in order to ‘fine-tune’ binding to thrombomodulin. Thus the P4 patch in 
early mammalian protein C was probably of the minimum size required to bind 
thrombomodulin, and was subsequently extended and refined by evolution. 

The binding patch P3 (comprising anionic residues Lys20, Arg23, Lys159, and 
Arg188) of the vitamin K-dependent factor ancestral protein is also apparent in 
early mammalian protein C. However, in the latter molecule, this patch was less 
expansive and confined to a higher latitude in the molecule. The apparent migra- 
tion of the patch away from the equatorial belt appears to have continued during 


Figure 10.5. The electrostatic profiles of the vitamin K-dependent factor ancestral 
protein (top, alpha carbon backbone is shown as a ribbon, electrostatic surface is solid), 
early mammalian protein C (bottom right, alpha carbon backbone is shown as a ribbon, 
electrostatic surface is solid) and extant human protein C (bottom left, alpha carbon 
backbone is shown as a ribbon, electrostatic surface is solid) are shown (Wacey et al., 
1997). The electrostatic equipotential surfaces are contoured at +1 kcalmol!. The view is 
towards the active site canyon. The locations of the anionic patches P3 and P4 are 
indicated. 
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evolution toward extant human protein C. The persistence of the anionic poten- 
tial of this area through evolutionary time, together with the report of a type II 
(dysfunctional) substitution (R352W, Reitsma et al., 1995) in this region, implies 
a functional role in both ancient and extant proteins. 


10.2.7 Comparative geometry of the active sites of early mammalian and 
extant protein C 


Extant human protein C and early mammalian protein C exhibit 83.6% sequence 
identity at the amino acid sequence level, and exhibit a backbone RMS diver- 
gence of 2.6 A (computed against a model of the serine protease domain of extant 
human protein C; Wacey et al., 1993). Recently, a relatively low resolution crystal 
structure of human APC (des Gla domain) has been solved (Mather et al., 1996). 
Comparison of our model with this structure served to demonstrate that the 
majority of functionally important residues identified in extant protein C were 
already present in its early mammalian predecessor. The Ca?+ binding loop 
(residues 70-80 with the exception of N78), the active site loop (residues 146-152) 
and the insertion helix (residue 129) are all to be found intact in early mammalian 
protein C. The asymmetric surface charge distribution of extant protein C is also 
apparent in early mammalian protein C. Finally, residues E192 and Y143 (the 
major determinants of substrate specificity) which span the active site, are present 
in both proteins. 

Since the regions close to the catalytic triad of extant human and early mam- 
malian protein C are likely to possess a similar functional architecture (owing to 
the minimal divergence in the template structures underlying the two models), the 
active site geometries of the two proteins may be reliably contrasted at relatively 
high resolution. Within the catalytic site pocket of extant human protein C, the S2 
residue (Ser 198) is predicted to have arisen by substitution of a Phe residue pre- 
sent in early mammalian protein C. This would imply that an initially tightly fit- 
ting catalytic pocket became more capacious during evolution. Why, however, the 
substrate specificity of extant human protein C is restricted to factors Va and VIIIa, 
although its active site pocket can theoretically accommodate larger substrates, 
remains unclear. With this exception, the amino acid sequence of the active site 
pocket of early mammalian protein C is identical to that of extant human protein 
C. Thus the structural features noted in the crystal structure of human APC would 
probably also have been found in the early mammalian protein. 

A prominent hydrophobic and solvent-accessible region is found on the surface 
of the serine protease domain of early mammalian protein C. This region has been 
shown to bind the second (C-terminal) EGF-like domain of the light chain in fac- 
tors Xa, IXa, and protein C (residues 100-116 and 45-50 respectively). Early 
mammalian protein C may therefore have possessed a light chain, an hypothesis 
consistent with Patthy’s (1985) view of serine protease evolution. 

In summary, the application of homology modeling to a reconstructed amino 
acid sequence allowed Wacey et al. (1997) to trace the evolution of specific 
structural features in protein C. This approach provided new insights into the 
evolution of protein C, allowed an assessment of the nature of the minimal throm- 
bomodulin binding site and permitted inferences to be made as to the possible 
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catalytic mechanism in early mammalian protein C. Such an approach lends itself 
readily to the study of the common ancestors of other multigene families and may 
therefore allow further ‘structure-function studies’ of phylogenetically recon- 
structed proteins to be performed. 
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Index to Human Gene 
Symbols used in text 


(arranged alphabetically by gene symbol) 


AIBG - Glycoprotein, o1-B, 60 

A2M — Macroglobulin, 02-, 59, 154, 267 

AAC] - Arylamine N-acetyltransferase, 107 

AACT - Antichymotrypsin, &l-, 192 

AARS — Aminoacyl tRNA synthetase (alanyl), 
178 

ABL2 — Abl proto-oncogene homolog 2, 59 

ABLL — Abl proto-oncogene-like, 62 

ABO — ABO blood group, 13, 87, 335, 402 

ACADM - Acyl-coenzyme A dehydrogenase, 
medium chain, 60 

ACADS - Acyl-coenzyme A dehydrogenase, 
short chain, 60 

ACHE - Acetylcholinesterase, 398 

ACO] — Aconitase 1, 15 

ACO2 — Aconitase 2, 15 

ACP2 - Acid phosphatase 2, lysosomal, 60 

ACPS5 — Acid phosphatase 5, tartrate-resistant, 
60 

ACTA] — Actin, o-, skeletal muscle, 15, 61, 151, 
152 

ACTA2 — Actin, o&-, smooth muscle, aorta, 15, 
151, 152 

ACTB - Actin, B-, non-muscle cytoplasmic, 
151, 152, 225, 271 

ACTC - Actin, a-, cardiac, 15, 61, 151, 152, 223 

ACTGI - Actin, y-, cytoplasmic 2, 110, 151, 
152, 271 

ACTG2 - Actin, y-, smooth muscle enteric, 15, 
151, 152 

ADA —- Adenosine deaminase, 18, 29 

ADHI - Alcohol dehydrogenase 1, 16, 17, 141 

ADH? - Alcohol dehydrogenase 2, 16, 17, 141, 
234 

ADHS3 - Alcohol dehydrogenase 3, 16, 17, 141 

ADH¢4 - Alcohol dehydrogenase 4, 16 

ADHS - Alcohol dehydrogenase 5, 16 

ADH7 - Alcohol dehydrogenase 7, 16 

ADPRT — ADP-ribosyltransferase, 270 

ADRAIB — Adrenergic receptor, «1B, 58, 346 

ADRA2C - Adrenergic receptor, «2C, 58 

ADRB2 — Adrenergic receptor, B2, 58, 107, 
346 

AFM - Albumin, o-/afamin, 152, 247, 348 

AFP — Fetoprotein, &-, 152, 153, 247, 348 

AGA - Aspartylglucosaminidase, 58 

AGT — Angiotensinogen, 61, 192 

AGTRI — Angiotensin receptor, 184, 271, 274 


AGXT - Peroxisomal L-alanine: glyoxylate 
aminotransferase, 358 

AHSG - Glycoprotein, «2—-HS-, 184 

Ak2 - Adenylate kinase 2, 62 

ALASI — Aminolevulinate synthase, 5-, 1, 21 

ALAS2 — Aminolevulinate synthase, 5-, 2, 66, 
78, 110 

ALB — Albumin, 152, 153, 247, 348 

ALD - Adrenoleukodystrophy protein, 145, 
268, 395, 396 

ALDH] - Aldehyde dehydrogenase 1, 118, 148 

ALDH10 — Aldehyde dehydrogenase 10, 118, 
148 

ALDH? - Aldehyde dehydrogenase 2, 88, 118 

ALDH3 - Aldehyde dehydrogenase 3, 118, 148 

ALDHS - Aldehyde dehydrogenase 5, 118, 148 

ALDH6 - Aldehyde dehydrogenase 6, 118, 148 

ALDH® - Aldehyde dehydrogenase 9, 118, 148 

ALDOA - Aldolase A, 234, 235 

ALPI - Alkaline phosphatase, intestinal, 228, 
347 

ALPP — Alkaline phosphatase, placental, 228, 
347 

ALPPL2 — Alkaline phosphatase, placental-like 
2, 347 

AMDI - Adenosylmethionine, S-, 
decarboxylase, 273 

AMELX — Amelogenin, X-linked, 66, 78, 80, 
334 

AMELY — Amelogenin, Y-linked, 78, 79, 80, 
334 

AMH - Anti-Miillerian hormone, 61 

AMYIA - Amylase, salivary 1A, 14, 239, 242, 
243, 331, 390, 404, 408 

AMYIB - Amylase, salivary 1B, 239, 242, 243, 
331, 390, 404, 408 

AMYIC - Amylase, salivary 1C, 34, 239, 242, 
243, 338, 404, 408 

AMY2A - Amylase, pancreatic 2A, 14, 242, 243, 
404, 408 

AMY2B - Amylase, pancreatic 2B, 14, 242, 243, 
404, 408 

ANPEP - Alanyl aminopeptidase, 60 

ANTS — Adenine nucleotide translocator 3, 80, 
394 

ANXI1 — Annexin I, 114, 225, 348 

ANXI1 — Annexin XI, 348 

ANX13 — Annexin XIII, 348 
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ANX2 — Annexin II, 114, 348, 353 

ANX3 — Annexin III, 58, 348 

ANX4 — Annexin IV, 348 

ANXS - Annexin V, 58, 348 

ANX6 — Annexin VI, 58, 114, 225, 348 

ANX7 — Annexin VII, 225, 348 

ANX8 — Annexin VIII, 348 

APOAI — Apolipoprotein A1, 60, 152, 154, 223, 
224, 228, 229, 236 

APOA2 — Apolipoprotein A2, 60, 152, 154 

APOA4 — Apolipoprotein A4, 60, 152, 154 

APOB - Apolipoprotein B, 8, 110, 116, 153, 
225, 246 

APOBECI - Apolipoprotein B mRNA-editing 
enzyme, 241, 246 

APOC] - Apolipoprotein C1, 60, 152, 154, 237, 
238, 266, 300, 350 

APOC2 — Apolipoprotein C2, 60, 152, 154, 237 

APOC3 - Apolipoprotein C3, 60, 152, 154, 225 

APOC4 - Apolipoprotein C4, 152, 153, 237, 238 

APOD - Apolipoprotein D, 153 

APOE - Apolipoprotein E, 16, 28, 60, 152, 154, 
225, 237, 238 

AR — Androgen receptor, 66, 189, 362, 364, 365 

ARAF!] — Raf proto-oncogene homolog 1, 78 

ARCNI - Archain, homolog of Drosophila gene, 
144 

AREG — Amphiregulin, 58 

ARGI] — Arginase, liver, 15, 231, 335 

ARSA - Arylsulfatase A, 176 

ARSB - Arylsulfatase B, 176 

ARSD - Arylsulfatase D, 80, 176 

ARSE - Arylsulfatase E, 80, 176 

ARSF - Arylsulfatase F, 176 

ART2P — ADP-ribosyltransferase 1 (inactive), 
282 

ASL — Argininosuccinate lyase, 15 

ASS — Argininosuccinate synthetase, 15, 17, 271 

AT3 — Antithrombin III, 60, 192, 236, 237 

ATOH] - Atonal, homolog of Drosophila gene, 
144 

ATPIAI1 - ATPase, Nat/K*, a1 polypeptide, 
60, 62 

ATPIA2 - ATPase, Nat/K*, «2 polypeptide, 
60, 62 

ATPI1A3 - ATPase, Na*t/K* transporting, o 
polypeptide, 60 

ATPIAL2 — ATPase, Na*/K* transporting, œ 
polypeptide-like 2, 60 

ATPIBI1 — ATPase, Na*/K*, B1 polypeptide, 
60, 62, 363 

ATP2A2 - ATPase, Ca’* transporting, cardiac 
muscle, slow twitch 2, 60 

ATP2B1 - ATPase, Ca’* transporting, plasma 
membrane 1, 60 

ATP2B2 - ATPase, Ca’* transporting, plasma 
membrane 2, 60 

ATP7A — ATPase, copper transporting, o- 
polypeptide, 274 

AXL — Axl proto-oncogene, 124 

B2M - Microglobulin, B2-, 124, 172 

BCATI — Branched chain aminotransferase 1, 
59 


BCAT2 - Branched chain aminotransferase 2, 
59 

BCP - Blue cone pigment, 316, 334 

BCR - Breakpoint cluster region, 363 

BDNF - Brain-derived neurotrophic factor, 
198 

BF — Properdin (B factor), 124, 171 

BFSP2 — Beaded filament structural protein 2, 
phakinin , 113 

BGLAP - Osteocalcin, 20 

BGP - Biliary glycoprotein, 343, 344 

BLM - DNA helicase (RecQ), Bloom 
syndrome-associated, 148 

BPY1 - Basic protein Y1, 79 

BPY2 - Basic protein Y2, 79 

BRAF - Raf proto-oncogene B1 homolog, 272 

BRCAI - Breast cancer susceptibility, early 
onset, 140, 238, 239, 240, 274 

BTK - Bruton’s tyrosine kinase, 110, 111 

CINH - C1 inhibitor, 61 

CIQA — Complement component 1Q a-chain, 
15, 60, 62, 153 

C1QB — Complement component 1Q f-chain, 
15, 60, 62, 153 

C1QG - Complement component 1Q y-chain, 
153 

CIR — Complement component C1R, 59, 153 

C1S — Complement component C1S, 59, 153 

C2 —Complement component C2, 153, 171, 346 

C3 —- Complement component C3, 59, 154 

C4A — Complement component C4A, 13, 153, 
171, 331, 346, 350, 351 

C4B — Complement component C4B, 13, 153, 
171, 331, 346, 350, 351 

C4BPA — C4b-binding protein a-chain, 15, 62, 
124 

C4BPB — C4b-binding protein B-chain, 15, 62, 
280 

C5 — Complement component C5, 153, 154, 
343, 346 

C6 —- Complement component 6, 109, 124, 153 

C7 —Complement component 7, 58, 109, 153 

C8A — Complement component C8A, 60, 62, 
153 

C8B — Complement component C8B, 60, 62, 
153 

C8G — Complement component C8G, 153 

C9 — Complement component 9, 59, 109, 153 

CAS — Carbonic anhydrase V, 267 

CACNAIA - Calcium channel, voltage 
dependent, P/Q type, a1A subunit, 362 

CACNAIC - Calcium channel, voltage- 
dependent, L-type, alpha 1C subunit, 59 

CAGRI - Cell fate determining gene, homolog 
of Caenorhabditis gene , 144 

CALB3 - Calbindin-D9K, 229 

CALCA - Calcitonin/calcitonin gene-related 
peptide a, 25, 59, 267 

CALCB - Calcitonin-related polypeptide beta, 
59 

CALM] — Calmodulin 1, 271 

CALMLI - Calmodulin-like gene 1, 108, 338 

CALML3 - Calmodulin-like gene 3, 108 
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CAPG - Capping protein Cap G, 109 

CAPNI - Calpain 1, large subunit, 61 

CAPN2 - Calpain 2, large subunit, 61 

CAPN3 - Calpain 3, 61 

CAPN4 - Calpain 4, small subunit, 61 

CARS - Aminoacyl-tRNA synthetase 
(cysteinyl), 178 

CBFA2TI1 - MTG8/ETO proto-oncogene, 112 

CBG - Corticosteroid-binding globulin, 192 

CBS - Cystathionine-B-synthase, 148 

CCR2 - Chemokine receptor 2, 86 

CCRS — Chemokine receptor 5, 87 

CDIA - CD1A antigen, 60 

CD1B - CD1B antigen, 60 

CDIC - CDIC antigen, 60, 397 

CDID - CD1D antigen, 60 

CDIE - CD1E antigen, 60 

CD36 — CD36 antigen, 347 

CD36L1 — CD36 antigen-like protein 1, 347 

CD36L2 - CD36 antigen-like protein 2, 347 

CD3D - CD3D antigen, 60 

CD3E - CD3E antigen, 60 

CD3G - CD3G antigen, 60 

CD3Z - CD3Z antigen, zeta polypeptide, 9, 60 

CD4 - CD4 antigen, 20, 60, 85 

CD48 — CD48 antigen, 60 

CD58 — CD58 antigen, 62 

CD68 - CD68 antigen, 231 

CD8A - CD8A antigen, 239, 240 

CDC4L - Cell division cycle 4-like, 244 

CDH] — Cadherin 1, 112, 183 

CDH11 — Cadherin 11, 183 

CDH12 - Cadherin 12, 183, 275 

CDH13 — Cadherin 13 , 183 

CDH15 — Cadherin 15, 183 

CDH2 - Cadherin 2, 183, 363 

CDHs3 - Cadherin 3, 183 

CDHS — Cadherin 5, 183 

CDKNZ2A - Cyclin-dependent kinase inhibitor 
(p16), 28, 408 

CDKN2B - Cyclin-dependent kinase inhibitor 
(p15), 408 

CDY - Chromodomain Y, 79 

CEA - Carcinoembryonic antigen, 59, 300, 347 

CEALI - Carcinoembryonic antigen-like 1, 59 

CENPB - Centromeric protein CENP-B, 338 

CETP - Cholesterol ester transfer protein, 346 

CFTR - Cystic fibrosis transmembrane 
conductance regulator, 86, 123, 223, 235, 
269, 352 

CGA - Chorionic gonadotropin, o chain, 210 

CGB - Chorionic gonadotropin, f chain, 60, 
227, 335, 336 

CHCI] - Mitotic regulator, CHC1, 9, 112, 353, 
354 

CHR39B — Cholesterol-repressible protein 39B, 
61 

CHRM1 - Cholinergic receptor, muscarinic 1, 
59 

CHRMs3 - Cholinergic receptor, muscarinic 3, 
59 

CHRM4 - Cholinergic receptor, muscarinic 4, 
59 


CHRMS - Cholinergic receptor, muscarinic 5, 
59 

CKM - Creatine kinase, muscle, 61 

CKMTI -Creatine kinase, mitochondrial 1, 61 

CLCNI1 — Chloride channel 1, 114 

CLCNS — Chloride channel 5, 114 

CLCN6 - Chloride channel 6, 114 

CLCN7 — Chloride channel 7, 114 

CMAI - Mast cell chymase, 191 

CMAH - Cytidine monophosphate-N- 
acetylneuraminic acid hydroxylase, 283 

COLI0A1 — Collagen, «1(X), 110, 157 

COL11A1 - Collagen, a1(X1), 60, 157 

COLI1I1A2 - Collagen, &2(XI), 156, 157, 346 

COLI12A1 — Collagen, «1(XII), 157 

COL13A1 — Collagen, «1(XIII), 157 

COL14A1 — Collagen, a1(XIV), 157 

COLI1S5A1 — Collagen, a1(XV), 157 

COLI16A1 — Collagen, a1(XVD, 157 

COLI17A1 — Collagen, a1(X VID, 157 

COLI18A1 — Collagen, a1(X VHD, 157 

COL19A1 — Collagen, «1(XIX), 157 

COLIAI1 - Collagen, a1(I), 110, 157 

COLIA2 — Collagen, «2(I), 157, 158 

COL2A1 - Collagen, «1(II), 60, 108, 109, 157 

COL3A1 — Collagen, a1(IID), 156, 157 

COL4A1 - Collagen, a1(IV), 156, 157, 231 

COL4A2 - Collagen, a2(IV), 20, 156, 157, 231 

COL4A3 — Collagen, a3(IV), 156, 157, 231 

COL4A4 — Collagen, a4(IV), 156, 157, 231 

COL4A5 - Collagen, a5(IV), 66, 156, 157, 231 

COL4A6 — Collagen, «6(IV), 156, 157, 231 

COLSAI - Collagen, «1(V), 8, 108, 156, 157, 
346 

COLSA2 - Collagen, «2(V), 156, 157 

COL6A1 — Collagen, a1(VD, 157 

COL6A2 - Collagen, &2(VI), 157 

COL6A3 — Collagen, &3(VI), 157 

COL7A1 - Collagen, a«1(VID, 8, 157 

COLS8A1 — Collagen, a1(VIID, 157 

COL8A2 — Collagen, «2(VIID, 60, 157 

COL9AI1 — Collagen, 1(IX), 157 

COL9A2 — Collagen, 02(IX), 157 

COL9A3 — Collagen, 03(IX), 157 

COX10 — Cytochrome c oxidase subunit X, 9 

COX4 — Cytochrome c oxidase subunit IV, 84, 
302 

COX5B — Cytochrome c oxidase subunit V b, 
223 

COX6B — Cytochrome c oxidase subunit VI b, 
271 

CP - Ceruloplasmin, 148 

CPS1 — Carbamoy] phosphate synthetase 1, 15, 
118, 148 

CRI — Complement component receptor 1, 62, 
355, 357 

CR2 - Complement component receptor 2, 62, 
130 

CRABPI - Cellular retinoic acid-binding 
protein, 60 

CRFB4¢ — Cytokine receptor B4, 348 

CRYAA - Crystallin, &A-, 130, 155 

CRYAB - Crystallin, &B-, 155, 185 
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CRYBAI - Crystallin, BA1-, 155 

CRYBA2 - Crystallin, BA2-, 155 

CRYBA4 - Crystallin, BA4-, 155 

CRYBBI - Crystallin, BB1-, 155 

CRYBB2 - Crystallin, 8B2-, 155 

CRYBB3 - Crystallin, BB3-, 155 

CRYGA - Crystallin, y-, 155 

CRYGB - Crystallin, yB-, 155, 228 

CRYGC - Crystallin, yC-, 155, 228 

CRYGD - Crystallin, yD-, 155 

CRYGS - Crystallin, yS-, 155 

CSFI1 — Colony stimulating factor 1, 58 

CSFIR - Colony-stimulating factor-1 receptor 
(Fms proto-oncogene), 16, 58, 270, 274 

CSF2 — Macrophage colony-stimulating factor, 
16, 58 

CSF2RA - Granulocyte macrophage colony- 
stimulating factor receptor o-chain, 16,78, 
80, 81, 109 

CSHI1 — Chorionic somatomammatropin 1, 141, 
162, 164, 165, 227, 268, 278, 404, 408, 410 

CSH2 - Chorionic somatomammatropin 2, 141, 
162, 164, 165, 227, 268, 278, 404, 408, 410 

CSHLI - Chorionic somatomammotropin-like, 
115, 278, 320, 410 

CSN2 — Casein, B-, 124, 390 

CSNK2A1 — Casein kinase 2-a1 subunit , 108 

CSRP2 — Cysteine and glycine-rich protein 2, 
272 

CSTI — Cystatin SN, 184 

CST2 — Cystatin SA, 184 

CST3 — Cystatin C, 184 

CST4 — Cystatin S, 184 

CSTA — Cystatin A, 184 

CSTB — Cystatin B, 184, 362 

CTNNAI — Catenin, aE-, 272 

CTRBI — Chymotrypsin, 191, 197 

CTRL - Chymotrypsin-like protease, 197 

CTSD - Cathepsin D, 61 

CTSE - Cathepsin E, 61 

CTSH - Cathepsin H, 61 

CUTL - Cut, homolog of Drosophila gene, 144 

CYBS — Cytochrome b5, 271 

CYBB - Cytochrome b-245, 66 

CYCI1 — Cytochrome c, 145 

CYP1IA - Cytochrome P450, 11A, 183 

CYP11B1 — Cytochrome P450, 11B1, 183 

CYP11B2 - Cytochrome P450, 11B2, 183 

CYP17 — Cytochrome P450, 17, 183, 184 

CYP19 — Cytochrome P450, 19, 183 

CYPIAI1 — Cytochrome P450, A1, 183 

CYPI1A2 - Cytochrome P450, A2, 183 

CYPIBI1 — Cytochrome P450, B1, 183 

CYP21 — Cytochrome P450, 21, 9, 18, 70, 183, 
184, 265, 266, 277, 350, 351, 411 

CYP24 — Cytochrome P450, 24, 183, 185 

CYP26A1 — Cytochrome P450, 26A1, 183 

CYP27A]1 — Cytochrome P450, 27A1, 183 

CYP27B1 — Cytochrome P450, 27B1, 183 

CYP2A13 — Cytochrome P450, 2A13, 183 

CYP2A6 — Cytochrome P450, 2A6, 183, 408 

CYP2A7 — Cytochrome P450, 2A7, 183, 408 

CYP2B6 — Cytochrome P450, 2B6, 183 


CYP2B7 — Cytochrome P450, 2B7, 183 

CYP2C10 — Cytochrome P450, 2C10, 183 

CYP2C18 — Cytochrome P450, 2C18, 183 

CYP2C19 — Cytochrome P450, 2C19, 183 

CYP2C8 — Cytochrome P450, 2C8, 183 

CYP2C9 — Cytochrome P450, 2C9, 183 

CYP2D6 — Cytochrome P450, 2D6, 183, 335 

CYP2E - Cytochrome P450, 2E, 183 

CYP2FI - Cytochrome P450, 2F1, 183 

CYP22 — Cytochrome P450, 2J2, 183 

CYP3A4 — Cytochrome P450, 3A4, 183 

CYP44A11 — Cytochrome P450, 4A11, 183 

CYP4B1 — Cytochrome P450, 4B1, 183 

CYP51 — Cytochrome P450, 51, 147, 183, 184 

CYP7A1 — Cytochrome P450, 7A1, 183 

D6S51E —HLA-associated transcript 2 
(BAT2), 342 

DAF - Decay accelerating factor, 62, 153, 343 

DAZ — Deleted in azoospermia, 79 

DAZLI — Deleted in azoospermia-like, Y- 
linked, 80 

DBI - Acyl-CoA binding protein, 273 

DBY — DEAD/H box 3, Y-linked, 79 

DCC — Deleted in colorectal carcinoma, 37, 403 

DCP] - Angiotensin I-converting enzyme, 88, 
342, 352 

DDX1 - DEAD/H box 1, 144 

DDX10 - DEAD/H box 10, 273 

DDX11 - DEAD/H box 11, 396 

DDX12 - DEAD/H box 12, 396 

DEFA] - Defensin A1, 129, 399 

DEFA3 — Defensin A3, 399 

DEFA4 — Defensin A4, 129, 399 

DEFAS — Defensin A5, 129, 399, 400 

DEFA6 — Defensin A6, 129, 399, 400 

DEFBI1 - Defensin B1, 129, 399 

DEFB2 - Defensin B2, 129 

DES - Desmin, 121 

DFFRY - Fat facets-related, homolog of 
Drosophila gene, 79 

DHCR7 - Dehydrocholesterol, 7-, reductase, 
399 

DHFR - Dihydrofolatereductase, 58, 270 

DLST — Dihydrolipoamide succinyltransferase, 
273 

DMD - Dystrophin, 7, 8, 9, 13, 17, 66, 108, 110, 
222, 335, 338, 345, 362 

DMPK - Dystrophia myotonica protein kinase, 
145, 363 

DNECL - Dynein, cytoplasmic, 145 

DRD1 — Dopamine receptor 1, 58, 184 

DRD2 — Dopamine receptor 2, 184 

DRD3 - Dopamine receptor 3, 184 

DRD4 — Dopamine receptor 4, 184 

DRDS - Dopamine receptor 5, 58, 184, 266 

DRPLA - Dentatorubral-pallidoluysian 
atrophy, 362 

DUSP8 — Dual specificity phosphatase 8, 273 

DVL1I - Dishevelled, homolog of Drosophila 
gene, 144 

E2F1 — Transcription factor E2F1, 23 

EBVM1 - Epstein-Barr virus modification site 
2,59 
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EBVS1 — Epstein-Barr virus insertion site 1, 59 

EDNRA - Endothelin receptor A, 184 

EDNRB - Endothelin receptor B, 184 

EEFIG - Translation elongation factor G, 148 

EGF — Epidermal growth factor, 16, 58 

EGFR - Epidermal growth factor receptor, 16, 
56, 144 

EIFIAY — Translation initiation factor 1A, Y- 
linked, 79 

EIF2S3 — Eukaryotic translation initiation 
factor 2, 79 

ELA] - Elastase, 59, 191, 197, 281 

ELK] — Elk-1 proto-oncogene, 272 

EMD - Emerin, 390, 391 

ENI — Engrailed 1, 249 

EN2 — Engrailed 2, 249 

ENOI1 - Enolase 1, 59, 271 

ENO2 — Enolase 2, 15, 59 

EPB41 — Erythroid protein 4.1, 112 

EPRS — Aminoacyl-tRNA synthetase 
(glutaminyl/prolyl), 178, 393, 398, 399 

ERBB2 - Erb B proto-oncogene homolog 2, 56 

ERBB3 - Erb B proto-oncogene homolog 3, 56, 
60 

ERBB4 — Erb B proto-oncogene homolog 4, 56 

ESA4 — Esterase A4, 59 

ESRI — Estrogen receptor, 189 

ESRRA - Estrogen-related receptor a, 273 

ETSI — Ets 1 proto-oncogene, 144, 251 

ETS2 — Ets 2 proto-oncogene, 251 

EVI2A - Ecotropic viral integration site 2A, 9, 
110 

EVI2B - Ecotropic viral integration site 2B, 9, 
110 

EWSRI — Ewing sarcoma breakpoint region 1, 
108 

F10 — Factor X, 16, 22, 23, 126, 191, 427 

F11 — Factor XI, 58, 126, 191 

F12 — Factor XII, 58, 127, 191 

F13A - Factor XIII subunit a, 15 

F13B — Factor XIII subunit b, 15, 123 

F2 — Prothrombin, 59, 88, 127, 191 

F2R — Thrombin receptor, 184 

F5 — Factor V, 88, 148 

F7 —- Factor VII, 12, 16, 22, 23, 87, 126, 191, 236, 
427 

F8A — Factor VIII gene-associated transcript, 9, 
37, 391 

F8C — Factor VIII, 8, 9, 27, 29, 37, 66, 116, 148, 
339, 345, 391 

F9 — Factor IX, 36, 110, 126, 191, 239, 245, 308, 
309, 427 

FABP3 - Fatty acid binding protein 3, 272 

FAU - Fbr-associated ubiquitously expressed 
gene, 270 

FCERIA -Immunoglobulin E receptor, Fc 
fragment, IA, 60, 140 

FCERIB — Immunoglobulin E receptor, Fe 
fragment, IB, 60, 110, 140 

FCERIG -Immunoglobulin E receptor, Fc 
fragment, IG, 60, 239, 240 

FCER2 - Immunoglobulin E receptor, Fe 
fragment, II, 60 


FCGRIA - Immunoglobulin G receptor, Fe 
fragment, IA, 65 

FCGRIB - Immunoglobulin G receptor, Fe 
fragment, IB, 65 

FCGRIC - Immunoglobulin G receptor, Fe 
fragment, IC, 65, 140 

FCGR2A - Immunoglobulin G receptor, Fc 
fragment, 2A, 60, 401 

FCGR2B - Immunoglobulin G receptor, Fe 
fragment, 2B, 60, 401 

FCGR2C - Immunoglobulin G receptor, Fe 
fragment, 2C, 60, 400, 401 

FCGR3A - Immunoglobulin G receptor, Fc 
fragment, 3A, 60 

FCGR3B - Immunoglobulin G receptor, Fe 
fragment, 3B, 60 

FDPSLI - Farnesyl diphosphate synthase-like 
1, 60 

FES — Fes proto-oncogene, 59 

FGA - Fibrinogen a-chain, 15, 109, 247, 348 

FGB - Fibrinogen B-chain, 15, 109, 247, 348, 
391 

FGF! — Fibroblast growth factor 1, 16, 58, 158, 
159 

FGF10 — Fibroblast growth factor 10, 158, 159 

FGF11 — Fibroblast growth factor 11, 158, 159 

FGFI12 — Fibroblast growth factor 12, 158 

FGF13 — Fibroblast growth factor 13, 158 

FGF14 — Fibroblast growth factor 14, 158 

FGF2 — Fibroblast growth factor 2, 21, 58, 158, 
159 

FGF3 — Fibroblast growth factor 3, 59, 158, 390 

FGF4 — Fibroblast growth factor 4, 59, 158, 
159, 390 

FGFS — Fibroblast growth factor 5, 58, 158 

FGF6 — Fibroblast growth factor 6, 59, 158, 159 

FGF7 — Fibroblast growth factor 7, 158, 266 

FGF8 — Fibroblast growth factor 8, 158 

FGF9 — Fibroblast growth factor 9, 158 

FGFRI — Fibroblast growth factor receptor 1, 
24, 159 

FGFR2 — Fibroblast growth factor receptor 2, 
24, 159 

FGFR3 - Fibroblast growth factor receptor 3, 
24, 58, 159 

FGFR¢4 — Fibroblast growth factor receptor 4, 
16, 58, 159 

FGG - Fibrinogen y-chain, 15, 109, 247, 348, 
391 

FGR — Fgr proto-oncogene homolog, 56, 59, 62 

FKHL13 - Forkhead FKHL13, 121 

FLG - Filaggrin, 16, 26, 357, 372 

FLII — Flightless, homolog of Drosophila gene, 
144 

FLN] — Filamin, 390, 391 

FMO2 - Flavin-containing monooxygenase 2, 
283, 284 

FMRI - Fragile X mental retardation 
syndrome type 1, 38, 360, 361, 362, 365, 
366 

FMR2 - Fragile X mental retardation 
syndrome type 2, 362 

FNI — Fibronectin, 113 
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FOS — Fos proto-oncogene, 144 

FPRI1 — Formylpeptide receptor 1, 108, 347 

FPRLI - Formylpeptide receptor-like 1, 347 

FPRL2 - Formylpeptide receptor-like 2 , 347 

FRDA - Frataxin, 362, 363 

FRV] — Full length retroviral sequence 1, 60 

FRV3 — Full length retroviral sequence 3, 60 

FSHB - Follicle stimulating hormone f, 60 

FTHI - Ferritin H chain, 21, 271 

FTHLI - Ferritin, heavy polypeptide-like 1, 62 

FTHL2 - Ferritin, heavy polypeptide-like 2, 62 

FTL — Ferritin L chain, 271 

FTNA - Fertilin, &-, 282 

FUTI] - Fucosyltransferase 1, 61, 141, 348 

FUT2 - Fucosyltransferase 2, 12, 61, 141, 348 

FUT3 - Fucosyltransferase 3, 12, 142, 348 

FUT4 — Fucosyltransferase 4, 61, 142, 348 

FUTS - Fucosyltransferase 5, 142, 348 

FUT6 - Fucosyltransferase 6, 142, 348 

FUT7 - Fucosyltransferase 7, 142, 348 

FUTS8 — Fucosyltransferase 8, 142, 348 

FXY — Finger on X and Y, 81 

FYN - Fyn proto-oncogene, 56 

GAA - Acid maltase, 111 

GABRAI - y-Aminobutyric acid receptor a1, 
58, 159, 392 

GABRA2 - y-Aminobutyric acid receptor a2, 
58, 159, 392 

GABRA3 - y-Aminobutyric acid receptor a3, 


159, 391 

GABRA4 -y-Aminobutyric acid receptor 04, 
159 

GABRAS -y-Aminobutyric acid receptor 5, 
159, 352, 391, 392, 396 

GABRA6 - y-Aminobutyric acid receptor «6, 
159 


GABRBI - y-Aminobutyric acid receptor ß1, 
58, 159, 392 

GABRB2 — y-Aminobutyric acid receptor B2, 
159, 392 

GABRB3 - y-Aminobutyric acid receptor 63, 
159, 392 

GABRB4 — y-Aminobutyric acid receptor ß4, 
159 

GABRD - y-Aminobutyric acid receptor 6, 159 

GABRE -y-Aminobutyric acid receptor £, 159 

GABRGI - y-Aminobutyric acid receptor yl, 


159, 392 

GABRG2 -y-Aminobutyric acid receptor 2, 
159, 392 

GABRG3 - y-Aminobutyric acid receptor ¥3, 
159, 392 

GABRRI - y-Aminobutyric acid receptor p1, 
159 

GABRR2 -y-Aminobutyric acid receptor p2, 
159 


GALE — UDP-galactose-4-epimerase, 15 

GALK1 - Galactokinase 1, 15 

GALK2 - Galactokinase 2, 15 

GALNS - Galactose 6-sulfatase, 176 

GALT - Galactose-1—phosphate 
uridylyltransferase, 15, 26, 402 

GANAB - Glucosidase, a, neutral AB, 61 


GANC — Glucosidase, œ, neutral C, 61 

GAPD - Glyceraldehyde-3—phosphate 
dehydrogenase, 15, 118, 271 

GARS — Aminoacyl-tRNA synthetase (glycyl), 
178 

GART - Glycinamide ribonucleotide 
synthetase, 25 

GATAI - GATA-binding protein 1, 66, 78 

GBA - Glucocerebrosidase, 267, 277 

GC - Group-specific component/vitamin D- 
binding globulin, 153, 247, 338 

GCG - Glucagon, 124 

GCP — Green cone pigment, 14, 20, 110, 316, 
330, 351, 401, 402, 408 

GDFI — Growth/differentiation factor 1, 25 

GDH - Glucose dehydrogenase, 15, 59 

GFAP - Glial fibrillary acidic protein, 121 

GFER - Growth factor, yeast 
Ervl—homologous, 399 

GGTAI - Galactosyltransferase, o-1, 3-, 143, 
281, 332 

GHI - Growth hormone 1, pituitary, 10, 16, 20, 
110, 141, 162, 164, 165, 222, 227, 254, 268, 
278, 302, 321, 322, 404, 410 

GH2 - Growth hormone 2, placental, 16, 141, 
162, 164, 165, 227, 268, 278, 321, 322, 404, 
410 

GHDTA - Growth hormone gene-derived 
transcriptional activator, 10 

GHR — Growth hormone receptor, 16, 346 

GIP — Glucose-dependent insulinotropic 
peptide, 118 

GLA — Galactosidase A, a-D-, 116, 247, 350 

GLI — Glioma-associated proto-oncogene, 60, 
390 

GMCSF - Granulocyte-macrophage colony- 
stimulating factor, 16 

GNA1II - Guanine nucleotide binding protein, 
alpha 11, 160 

GNA12 - Guanine nucleotide binding protein, 
alpha 12, 160 

GNA13 — Guanine nucleotide binding protein, 
alpha 13, 160 

GNA14 — Guanine nucleotide binding protein, 
alpha 14, 160 

GNA1S5 - Guanine nucleotide binding protein, 
alpha 15, 160 

GNAII - Guanine nucleotide binding protein, 
alpha inhibiting activity polypeptide 1, 
160 

GNAI - Guanine nucleotide binding protein, 
alpha inhibiting activity 2, 160 

GNAI2L — Guanine nucleotide binding 
protein, alpha inhibiting activity 
polypeptide 2-like, 59 

GNATJ3 - Guanine nucleotide binding protein, 
alpha inhibiting activity 3, 59, 160 

GNAL - Guanine nucleotide binding protein, 
alpha activating activity polypeptide, 
olfactory type, 160 

GNAOI - Guanine nucleotide binding protein, 
alpha activating activity polypeptide O, 
160 
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GNAQ - GTP-binding protein Gag, 160, 272 

GNAS1 - Guanine nucleotide binding protein, 
alpha stimulating activity polypeptide 1, 
160 

GNATI - Guanine nucleotide binding protein, 
alpha transducing activity 1, 160 

GNAT? - Guanine nucleotide binding protein, 
alpha transducing activity 2, 59, 160 

GNAZ - Guanine nucleotide binding protein, 
alpha Z, 160 

GNBI — Guanine nucleotide binding protein, 
beta polypeptide 1, 59 

GNB3 - Guanine nucleotide binding protein, 
beta polypeptide 3, 59 

GNS - Glucosamine 6-sulfatase, 176 

GOT2L1 — Glutamic-oxaloacetic transaminase 
2-like 1, 59, 62 

GOT2L2 — Glutamic-oxaloacetic transaminase 
2-like 2, 59, 62 

GOT2L3 — Glutamic-oxaloacetic transaminase 
2-like, 59 

GPD2 - Mitochondrial-3—-glycerol phosphate 
dehydrogenase 2, 272 

GPI — Glucose phosphate isomerase, 59, 148 

GPRI - G-protein-coupled receptor 1, 185, 228 

GPR10 — G-protein-coupled receptor 10, 185 

GPRI12 - G-protein-coupled receptor 12, 185 

GPR13 — G-protein-coupled receptor 13, 185 

GPRI15 - G-protein-coupled receptor 15, 185 

GPRI18 — G-protein-coupled receptor 18, 185 

GPR2 - G-protein-coupled receptor 2, 185 

GPR20 - G-protein-coupled receptor 20, 185 

GPR3 - G-protein-coupled receptor 3, 185 

GPR4 - G-protein-coupled receptor 4, 185 

GPRS - G-protein-coupled receptor 5, 185 

GPR6 - G-protein-coupled receptor 6, 185 

GPR7 - G-protein-coupled receptor 7, 185 

GPR8& - G-protein-coupled receptor 8, 185 

GPR9 — G-protein-coupled receptor 9, 185 

GPRKé6 - G protein-coupled receptor kinase 
GRK6, 273 

GPX5 — Glutathione peroxidase type 5, 115 

GRL - Glucocorticoid receptor, 58, 189 

GRMS - Glutamate receptor subtype 5, 
metabotropic, 113 

GSN — Gelsolin, 109 

GSTAI — Glutathione S-transferase Al, 363 

GSTA2 — Glutathione S-transferase 2, 390 

GSTM1 - Glutathione S-transferase, mu-class 
1, 60, 330 

GSTM2 — Glutathione S-transferase, mu-class 
2, 330 

GSTM4 — Glutathione S-transferase, mu-class 
4, 330 

GSTMS — Glutathione S-transferase, mu-class 
5, 330 

GSTP1 — Glutathione S-transferase, Pi class, 60 

GSTT] — Glutathione S-transferase, theta-class 
1, 330 

GTBP - G/T mismatch-binding protein, 147 

GUKI — Guanylate kinase 1, 62 

GULOP - Gulono-y-lactone, L-, oxidase, 282, 
320 


GUSB — Glucuronidase, B-, 267 

GYPA - Glycophorin A, 164, 165, 166, 321, 
350, 402, 404, 405, 408 

GYPB - Glycophorin B, 130, 164, 165, 166, 
321, 335, 350, 404, 405 

GYPE - Glycophorin E, 130, 164, 165, 335, 
350, 404, 405, 408 

GZMA - Granzyme A, 58 

H19 — Adult skeletal muscle gene, imprinted 
(D11S878E), 9, 30 

H1F0 — Histone H1°, 180, 228 

HIFI —- Histone 1, family 1, 180, 228 

HI1F2 - Histone 1 family 2, 14, 60, 180 

H1F3 - Histone 1 family 3, 14, 180 

H1F4 — Histone 1 family 4, 14, 60, 180 

HIF5 — Histone 1, family 5, 180 

HIFT — Histone H1, testis-specific, 180 

H2A - Histone 2 family A, 14, 180, 181, 231 

H2AX — Histone H2A.X, 180 

H2AZ - Histone H2A.Z, 180 

H2B - Histone 2 family B, 14, 180, 181, 231 

H3F2 — Histone 3 family 2, 14 

H3F3A — Histone H3.3A, 180 

H3F3B — Histone H3.3B, 180 

H4F2 — Histone 4 family 2, 14 

HARS - Aminoacyl-tRNA synthetase 
(hystidyl), 118, 119, 178, 393 

HBAI - Globin, &l-, 15, 20, 161, 234, 247, 267, 
268, 275, 351, 404, 408 

HBA2 - Globin, 02-, 15, 161, 268, 404, 408 

HBB - Globin, B-, 7, 13, 15, 16, 20, 29, 67, 85, 
86, 111, 162, 223, 244, 268, 408, 411 

HBD - Globin, 6-, 7, 16, 162, 247, 268, 408 

HBE] - Globin, €-, 16, 162, 222, 223, 238, 241, 
268 

HBGI - Globin, y2-, 7, 16, 162, 222, 223, 230, 
235, 268, 349, 404, 408, 411 

HBG2 - Globin, y2-, 16, 162, 222, 230, 235, 268, 
349, 408, 411 

HBQI1 - Globin, 8-, 161, 240, 268 

HBZ — Globin, C-, 161, 234, 268, 275, 279, 331, 
351, 404 

HCF2 — Heparin cofactor II, 192 

HD — Huntingtin, 362, 364, 365 

HDACI - Histone deacetylase, 145 

HES|1 — Rho cross-reacting protein 27A, E. coli 
homolog, 147 

HEXB - Hexosaminidase B, beta polypeptide, 
58 

HFI] -H factor, 153 

HFE — Hemochromatosis, 171 

HHLA1I - HERV-H-associating 1, 243 

HKI — Hexokinase 1, 146 

HK2 — Hexokinase 2, 146, 273 

HAK3 — Hexokinase 3, 146 

HKRI - Gli-kruppel family member, HKR1, 
60 

HKRz2 - Gli-kruppel family member, HKR2, 
60 

HKR3 - Gli-kruppel family member, HKR3, 
60 

HLA-A - Major histocompatibility complex, 
class Ia, A, 171, 173, 227, 267, 278 


446 HUMAN GENE EVOLUTION 


HLA-B - Major histocompatibility complex, 
class Ia, B, 140, 171, 173, 227, 411 
HLA-C — Major histocompatibility complex, 
class Ia, C, 171, 173 

HLA-DNA - Major histocompatibility 
complex, class II, DNA, 171 

HLA-DQAI - Major histocompatibility 
complex, class II, DQA1, 171 

HLA-DQA2 - Major histocompatibility 
complex, class II, DQA2, 171 

HLA-DQBI - Major histocompatibility 
complex, class II, DQB1, 277, 408 

HLA-DRA - Major histocompatibility 
complex, class II, DRA, 171, 411 

HLA-DRBI1 - Major histocompatibility 
complex, class II, DRB1, 86, 171, 344 

HLA-DRB2 - Major histocompatibility 
complex, class II, DRB2, 171 

HLA-DRB3 - Major histocompatibility 
complex, class II, DRB3, 171 

HLA-DRB4 - Major histocompatibility 
complex, class II, DRB4, 171 

HLA-DRBS - Major histocompatibility 
complex, class II, DRBS5, 171 

HLA-DRB6 - Major histocompatibility 
complex, class II, DRB6, 121, 239, 244, 
399 

HLA-E - Major histocompatibility complex, 
class Ib, E, 171, 173, 174, 267 

HLA-F - Major histocompatibility complex, 
class Ib, F 171, 173, 267 

HLA-G - Major histocompatibility complex, 
class Ib, G, 171, 173, 267 

HMG] - High mobility group protein 1, 146, 
250 

HMG1/4 - High mobility group protein 14, 250 

HMGI7 - High mobility group protein 17, 250 

HMG2 - High mobility group protein 2, 146, 
250 

HMG4 - High mobility group protein 4, 250 

HMGCS1 - HMG CoA synthase, cytoplasmic, 
349 

HMGCS2 - HMG CoA synthase, 
mitochondrial, 349 

HMR - Hormone receptor TR3, 60 

HMX2 — Homeobox, H6 family 2, 167 

HNF4A - Hepatocyte nuclear factor 4, 189 

HNRPA1| — Heterogeneous nuclear ribonuclear 
riboprotein Al, 176 

HOXA1 — Homeobox A1, 166 

HOXA10 — Homeobox A10, 166 

HOXAII — Homeobox A11, 166 

HOXA13 — Homeobox A13, 166 

HOXA2 — Homeobox A2, 166 

HOXA3 — Homeobox A3, 166 

HOXA4 — Homeobox A4, 166 

HOXAS - Homeobox A5, 166 

HOXA6 — Homeobox A6, 166 

HOXA7 — Homeobox A7, 166 

HOXA9 — Homeobox A9, 166 

HOXBI1 — Homeobox B1, 166 

HOXBI13 — Homeobox B13, 166 

HOXB2 — Homeobox B2, 166 


HOXB3 — Homeobox B3, 166 

HOXB4 — Homeobox B4, 166 

HOXBS — Homeobox B5, 166 

HOXB6 — Homeobox B6, 166 

HOXB7 — Homeobox B7, 166 

HOXB8 — Homeobox B8, 166 

HOXB9 — Homeobox B9, 166 

HOXC10 — Homeobox C10, 167 

HOXC11 - Homeobox C11, 167 

HOXC12 — Homeobox C12, 167 

HOXC13 - Homeobox C13, 167 

HOXC4 — Homeobox C4, 166 

HOXCS — Homeobox C5, 166 

HOXC6 — Homeobox C6, 167 

HOXC8 - Homeobox C8, 167, 249 

HOXC9 — Homeobox C9, 167 

HOXDI1 — Homeobox D1, 167 

HOXD10 — Homeobox D10, 167 

HOXDI1I1 — Homeobox D11, 167 

HOXD12 — Homeobox D12, 167 

HOXD13 - Homeobox D13, 167, 362 

HOXD3 - Homeobox D3, 167 

HOXD4 — Homeobox D4, 167 

HOXD8 - Homeobox D8, 167 

HOXD9 — Homeobox D9, 167 

HP — Haptoglobin, 141, 282, 329, 337, 351, 404, 
408 

HPR — Haptoglobin-related protein, 141, 282, 
404, 408 

HPRTI1 - 
Hypoxanthinephosphoribosyltransferase, 
29, 331 

HPS - Hermansky-Pudlak syndrome, 23 

HPX — Hemopexin, 123 

HRAS - Harvey Ras proto-oncogene, 56, 58, 
245 

HRG - Histidine-rich glycoprotein, 184 

HSD3B2 — Hydroxysteroid dehydrogenase, 3ß- 
, type I, 110 

HSPAIA — Heat shock protein, 70 kDa, AlA, 
18, 108, 185, 346 

HSPAIB — Heat shock protein, 70 kDa, A1B, 
185, 346 

HSPAIL — Heat shock protein, 70 kDa, AIL, 
185, 346 

HSPA2 — Heat shock protein, 70 kDa, A2, 108, 
185 

HSPA3 - Heat shock protein, 70 kDa, A3, 185 

HSPA4 — Heat shock protein, 70 kDa, A4, 185 

HSPAS — Heat shock protein, 70 kDa, A5, 185 

HSPA6 - Heat shock protein, 70 kDa, A6, 185 

HSPA7 — Heat shock protein, 70 kDa, A7, 185 

HSPA8 — Heat shock protein, 70 kDa, A8, 185 

HSPA9 — Heat shock protein, 70 kDa, A9, 185 

HSPB2 — Heat shock protein 27, 185 

HSPCA — Heat shock protein 90, A, 185 

HSPCB - Heat shock protein 90, B, 110, 185 

HSPG2 — Perlecan, 124 

ATRIA - Serotonin receptor 1A, 58, 346 

ATRID - Serotonin receptor 1D, 108, 272 

ATRIF - Serotinin receptor 1K 108 

HXB — Tenascin C, 57 

IAPP -Islet amyloid polypeptide, 59, 270 
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IARS — Aminoacyl-tRNA synthetase 
(isoleucyl), 178 

ID2 — Inhibitor of DNA binding, 108 

IDH3G - Isocitrate dehydrogenase, 233 

IDO - Indoleamine 2,3 dioxygenase, 109, 197 

IDS — \duronate-2-sulfatase, 176, 268, 346 

IF -I factor, 58, 153 

IFNA] — Interferon a1, 186, 187, 408 

IFNA10 — Interferon «10, 186, 187, 188, 333 

IFNA13 — Interferon «13, 186, 187, 408 

IFNA14 — Interferon 14, 186, 187 

IFNA16 — Interferon «16, 186, 187 

IFNA17 — Interferon «17, 186, 187 

IFNA2 - Interferon 02, 186, 187, 188, 333 

IFNA21 -Interferon «21, 186, 187 

IFNA4 - Interferon 04, 186, 187 

IFNAS - Interferon 05, 186, 187 

IFNA6 - Interferon 06, 186, 187 

IFNA7 -Interferon «7, 186, 187 

IFNA8 - Interferon 08, 186, 187 

IFNARI -Interferon receptor a, B, œ, 1, 16, 
199, 346, 348 

IFNAR2 -Interferon receptor a, B, œ, 2, 199 

IFNBI — Interferon B, 16, 108, 186, 187, 225 

IFNG - Interferon y, 16, 186, 189 

IFNGRI - Interferon receptor yl, 16 

IFNGR2 - Interferon receptor y2, 199 

IFNW/1 -Interferon w1, 16, 186, 187 

IGF] — Insulin-like growth factor 1, 56, 60, 143, 
186, 198 

IGFIR -Insulin-like growth factor 1 receptor, 
60, 198 

IGF2 - Insulin-like growth factor 2, 30, 56, 60, 
130, 143, 186, 245, 393 

IGF2R -Insulin-like growth factor 2 receptor, 
30, 393 

IGHA1 -Immunoglobulin heavy chain, 
constant ol, 194, 408, 410 

IGHA2 - Immunoglobulin heavy chain, 
constant 02, 194, 408 

IGHD - Immunoglobulin heavy chain, 
constant 5, 194, 245, 346 

IGHDY -Immunoglobulin heavy chain, D 
segments, 194 

IGHDY2 - Immunoglobulin D segments 
(orphon), 195 

IGHE - Immunoglobulin heavy chain, 
constant £1, 194, 267 

IGHGI1 — Immunoglobulin heavy chain, 
constant yl, 194, 410 

IGHG2 — Immunoglobulin heavy chain, 
constant y2, 194 

IGHG3 - Immunoglobulin heavy chain, 
constant y3, 194 

IGHG4 - Immunoglobulin heavy chain, 
constant y4, 194, 331, 352 

IGHF —- Immunoglobulin heavy chain, J 
segments, 194, 245 

IGHM -Immunoglobulin heavy chain, 
constant pu, 194 

IGHV -Immunoglobulin heavy chain, variable 
segments, 194, 195, 267, 272, 274, 278, 
352, 408 


IGHV2 - Immunoglobulin V,, segments 
(orphon), 195 

IGHV3 - Immunoglobulin V,, segments 
(orphon), 195 

IGKC -Immunoglobulin light chain, kappa, 
constant, 195, 346 

IGKF — Immunoglobulin light chain, kappa, J 
segments, 195 

IGKV - Immunoglobulin light chain, kappa, 
variable segments, 194, 267, 391, 396, 401 

IGLC — Immunoglobulin light chain, lambda, 
constant, 195, 271, 346, 410 

IGLF — Immunoglobulin light chain, lambda, J 
segments, 195 

IGLLI — Immunoglobulin -like polypeptide 1, 
277 

IGLV — Immunoglobulin light chain, lambda, 
variable segments, 195 

IHH -Indian hedgehog, 251 

IL11RA — Interleukin 11 receptor a, 26, 402 

IL12B -Interleukin 12B, 16 

IL13 — Interleukin 13, 16 

ILIA — Interleukin-1a, 199, 321 

ILIB — Interleukin-1, 199, 321 

ILIRN - Interleukin-1 receptor antagonist, 
199, 321 

IL2 — Interleukin 2, 58 

IL3 — Interleukin 3, 16, 58 

IL3RA -Interleukin 3 receptor a, 16, 80, 81 

IL4 — Interleukin 4, 16, 58 

IL4R — Interleukin 4 receptor, 198 

ILS — Interleukin 5, 16, 58 

IL6 — Interleukin 6, 236 

IL6ST — Interleukin-6 signal transducer, 272 

IL8 - Interleukin 8, 198 

IL8RA — Interleukin-8-receptor A, 198, 199, 
267 

IL8RB — Interleukin-8 receptor B, 198, 199, 267 

IL9 — Interleukin 9, 16, 58, 396 

IL9R — Interleukin 9 receptor, 396 

INS - Insulin, 16, 30, 56, 60, 186, 198, 245, 393 

INSL2 -Insulin-like 2, 60 

INSR -Insulin receptor, 16, 60, 186, 198 

INSRR -Insulin receptor-related, 60 

IPFI1 — Insulin promoter factor 1, 335 

IPW -Imprinted in Prader-Willi syndrome, 9 

ITGAI] — Integrin, «1, 168, 169 

ITGA2 - Integrin, «2, 168, 169 

ITGA2B - Integrin, «2b, 15, 169, 247, 348 

ITGA3 - Integrin, 03, 15, 168, 169, 247, 348 

ITGA4 — Integrin, «4, 168, 169 

ITGAS - Integrin, a5, 168, 169 

ITGA6 - Integrin, «6, 168, 169 

ITGA7 - Integrin, «7, 168, 169 

ITGAS - Integrin, «8, 168, 169 

ITGA9Y — Integrin, «9, 168, 169 

ITGAD - Integrin, aD, 168, 169 

ITGAE - Integrin, aE, 168, 169 

ITGAL - Integrin, aL, 168, 169 

ITGAM - Integrin, aM, 168, 169 

ITGAV - Integrin, &V, 169 

ITGAX - Integrin, aX, 169 

ITGBI1 — Integrin, B1, 169 
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ITGB2 - Integrin, B2, 169 

ITGB3 - Integrin, 83, 169 

ITGB4 - Integrin, B4, 169 

ITGBS - Integrin, B5, 169 

ITGB6 — Integrin, B6, 169 

ITGB7 - Integrin, 87, 169 

ITGB8 - Integrin, B8, 169 

ITIH] - Inter-a-trypsin inhibitor heavy chain 
1, 391 

ITIH2 - Inter-a-trypsin inhibitor heavy chain 
2, 391 

ITIH3 - Inter-a-trypsin inhibitor heavy chain 
3, 391 

ITIH4 - Inter-a-trypsin inhibitor heavy chain 
4, 391 

IVD - Isovaleryl coenzyme A dehydrogenase, 
60 

IVL — Involucrin, 16, 365, 367, 373 

JAKI - Janus tyrosine kinase 1, 56 

JAK2 - Janus tyrosine kinase 2, 56 

JAK3 - Janus tyrosine kinase 3, 56 

JUN — Jun proto-oncogene, 59, 107, 144 

JUNB - Jun B proto-oncogene, 59 

FUND - Jun D proto-oncogene, 59 

KALI - Kallmann syndrome 1, 78, 267 

KARS — Aminoacyl-tRNA synthetase (lysyl), 
178 

KCNA] — Potassium voltage-gated channel, 
shaker-related subfamily, member 1, 60 

KCNA2 — Potassium voltage-gated channel, 
shaker-related subfamily, member 2, 60 

KCNAS - Potassium voltage-gated channel, 
shaker-related subfamily, member 5, 60 

KCNA7 — Potassium voltage-gated channel, 
shaker-related subfamily, member 7, 60 

KCNC2 — Potassium voltage-gated channel, 
Shaw-related subfamily, member 2, 60 

KCNC3 — Potassium voltage-gated channel, 
Shaw-related subfamily, member 3, 60 

KCNC4 — Potassium voltage-gated channel, 
Shaw-related subfamily, member 4, 60 

KHK - Ketohexokinase, 355 

KIT — Kit proto-oncogene, 58 

KLK!1 - Kallikrein, renal/pancreatic/salivary, 
59 

KLK2 — Kallikrein, glandular, 59 

KLK3 — Kallikrein, plasma, 58, 191 

KLRCI - Lectin-like type II integral 
membrane protein, 342 

KNG - Kininogen, 184, 352 

KRAS2 — Ki-Ras 1 proto-oncogene, 56, 59, 272 

KRTI - Keratin 1, 169, 170 

KRT10 — Keratin 10, 169, 170 

KRTI12 — Keratin 12, 170 

KRT13 — Keratin 13 , 170 

KRT14 — Keratin 14, 169, 170, 266 

KRT1S5 — Keratin 15, 169, 170 

KRT16 — Keratin 16, 169, 170 

KRTI17 — Keratin 17, 169, 170 

KRT18 — Keratin 18, 169, 170, 238, 239 

KRT19 — Keratin 19, 169, 170 

KRT2A — Keratin 2A, 169, 170 

KRT3 — Keratin 3, 170 


KRT4 — Keratin 4, 170 

KRTS5 — Keratin 5, 169, 170 

KRT¢6A — Keratin 6A, 169, 170 

KRT6B — Keratin 6B, 169, 170 

KRT7 — Keratin 7, 170 

KRTS — Keratin 8, 169, 170 

KRT9 — Keratin 9, 169, 170 

KRTHAI - Keratin, hair, acidic 1 , 170 

KRTHA2 - Keratin, hair, acidic 2, 170 

KRTHAS3A - Keratin, hair, acidic 3A, 170 

KRTHA3B - Keratin, hair, acidic 3B, 170 

KRTHA4 - Keratin, hair, acidic 4, 170 

KRTHAS - Keratin, hair, acidic 5, 170 

KRTHBI - Keratin, hair, basic 1, 170 

KRTHB2 - Keratin, hair, basic 2, 170 

KRTHB3 - Keratin, hair, basic 3, 170 

KRTHB4 - Keratin, hair, basic 4, 170 

KRTHBS - Keratin, hair, basic 5, 170 

KRTHB6 - Keratin, hair, basic 6, 170 

LALBA - Lactalbumin, o-, 142, 349 

LAMP2 - Lysosomal-associated membrane 
protein 2, 66 

LAMRI - Laminin receptor, 68 kDa, 9 

LARS - Aminoacyl-tRNA synthetase (leucyl), 
178 

LBR — Lamin B receptor, 399 

LCAT - Lecithin:cholesterol acyltransferase, 
113, 343 

LCK - Lymphocyte-specific protein, tyrosine 
kinase, 59, 62 

LDHA - Lactate dehydrogenase A, 15, 59, 349 

LDHAL2 - Lactate dehydrogenase A-like 2, 59 

LDHB - Lactate dehydrogenase B, 15, 59, 349 

LDHC - Lactate dehydrogenase C, 59 

LDLR - Low density lipoprotein receptor, 16, 
29, 60, 123 

LEFI — Lymphoid enhancer-binding factor 1, 
250 

LHB - Luteinizing hormone ß, 60, 227, 336 

LIFR — Leukemia inhibitory factor receptor, 
113, 339 

LIPC - Lipase, hepatic, 59, 122, 144 

LIPE — Lipase, hormone-sensitive, 59, 236 

LLGLI — Tumor supressor gene lethal 2 giant 
larvae, homolog of Drosophila gene, 144 

LMNB2 - Lamin B, 108 

LMOI —- Rhombotin 1, 59 

LMO2 — Rhombotin-like 1, 59 

LMO3 — Rhombotin-like 2, 59 

LMP2 - Proteasome subunit, beta type, 9, 231 

LOR - Loricrin, 16, 372, 373 

LPA — Apolipoprotein(a), 141, 236, 239, 245, 
318, 357 

LPL — Lipoprotein lipase, 13, 122, 144, 236 

LRP] — Low density lipoprotein-related 
protein, 60 

LRP8 — Apolipoprotein E receptor 2, 130 

LTF — Lactoferrin, 229 

LTK - Leukocyte tyrosine kinase, 60 

LTNA - Lymphotactin a, 274 

LTNB — Lymphotactin B, 274 

LYLI — Lymphoblastic leukemia-derived 
sequence 1, 59 


INDEX TO HUMAN GENE SYMBOLS USED IN THE TEXT 449 


LYZ — Lysozyme, 301, 342, 349 

M1782 — Membrane component, surface 
marker 2 , 238 

M1782 — Membrane component, surface 
marker 2, 238 

MAG - Myelin-associated glycoprotein, 59 

MAGP2 - Microfibril-associated glycoprotein 


2, 109 
MANAI - Mannosidase, alpha A, cytoplasmic, 
61 


MANB - Mannosidase, alpha B, lyosomal, 61 

MAOA - Monoamine oxidase A, 78, 109, 236 

MAOB - Monoamine oxidase B, 109 

MAPIA - Microtubule-associated protein 1A, 
25 

MAPIB - Microtubule-associated protein 1B, 
25 

MARS - Aminoacyl-tRNA synthetase 
(methionyl), 178 

MATN] - Cartilage matrix protein, 23 

MAX — Max protein, 199 

MB - Myoglobin, 160 

MBP — Mannose-binding protein, 267, 404 

MC2R - Adrenocorticotropic hormone 
receptor, 184 

MCM2 - Minichromosome maintenance, yeast 
homolog 2 , 145 

MCM3 - Minichromosome maintenance, yeast 
homolog 3, 145 

MCM4 - Minichromosome maintenance, yeast 
homolog 4, 145, 233 

MCMS - Minichromosome maintenance, yeast 
homolog 5, 145 

MCM6 - Minichromosome maintenance, yeast 
homolog 6, 145 

MCM7 - Minichromosome maintenance, yeast 
homolog 7, 145 

MCP — Membrane cofactor protein, 153 

MEI - NADP-dependent malate 
dehydrogenase, 390 

MEF2A — MADS box enhancing factor 2A, 56, 
275 

MEF2B - MADS box enhancing factor 2B, 56 

MEF2C - MADS box enhancing factor 2C, 56 

MEF2D - MADS box enhancing factor 2D, 56 

MFAP2 - Microfibril-associated protein 2, 109 

MGBI1 — Mammaglobin 1, 228 

MGB2 — Mammaglobin 2, 228 

MGMT - Methylguanine, 06-, DNA 
methyltransferase, 110 

MGSTI1 — Microsomal glutathione S- 
transferase 1, 60 

MIC2 — MIC2 antigen, 78, 80 

MICA — Major histocompatibility complex 
class I- related A, 171 

MICB - Major histocompatibility complex 
class I- related B, 171 

MICC - Major histocompatibility complex 
class I- related C, 171 

MICD - Major histocompatibility complex 
class I- related D, 171 

MICE - Major histocompatibility complex 
class I- related E, 171 


MJD -— Ataxin 3, 362, 364 

MLH] -— Mismatch repair protein, E. coli mutL 
homolog 1, 147 

MLL - Myeloid/lymphoid leukemia, Drosophila 
trithorax homolog, 403 

MLR - Mineralocorticoid receptor, 58, 189 

MMPI — Matrix metalloproteinase 1, 236, 348 

MMP10 — Matrix metalloproteinase 10, 348 

MMP11 - Matrix metalloproteinase 11, 348 

MMP12 — Matrix metalloproteinase 12, 348 

MMP13 - Matrix metalloproteinase 13, 348 

MMP14 — Matrix metalloproteinase 14, 348 

MMP'1S5 — Matrix metalloproteinase 15, 348 

MMP16 - Matrix metalloproteinase 16, 348 

MMP19 — Matrix metalloproteinase 19, 348 

MMP2 - Matrix metalloproteinase 2, 348 

MMP3 - Matrix metalloproteinase 3, 348 

MMP7 - Matrix metalloproteinase 7, 348 

MMP8 - Matrix metalloproteinase 8, 348 

MMP9 - Matrix metalloproteinase 9, 348 

MNBH - Minibrain, homolog of Drosophila 
gene, 144 

MOCOD - Molybdenum cofactor synthesis 
protein, 25 

MOK2 - MOK2 transcription factor, 193, 253 

MPI -— Mannose phosphate isomerase, 59 

MPO - Myeloperoxidase, 122, 239, 240 

MRCI — Macrophage mannose receptor, 353 

MSH2 - Mismatch repair protein, E. coli mutS 
homolog 2 , 147 

MSMB - Microseminoprotein, B-, 300 

MSX1 — Homeobox 7, 167 

MT2A —- Metallothionein 2A, 18 

MTAP - Methylthioadenosine phosphorylase, 
273 

MTCO2 - Cytochrome c oxidase subunit II, 
355 

MTX - Metaxin, 267 

MUCI - Mucin 1, transmembrane, 61, 175, 
357, 372 

MUC2 — Mucin 2, intestinal/tracheal, 9, 17, 61, 
175, 176, 357 

MUC3 — Mucin 3, intestinal, 9, 175 

MUC4 — Mucin 4, tracheobronchial, 9, 175, 357 

MUCSAC — Mucin 5AC, 
tracheobronchial/gastric, 9, 17, 61, 175, 
176 

MUCSB - Mucin 5, tracheobronchial, 9, 17, 61, 
116, 175, 176 

MUC6 - Mucin 6, gastric, 9, 17, 175, 176 

MUC7 - Mucin 7, salivary, 175 

MUC8 —- Mucin 8, tracheobronchial, 175 

MYB - Myb proto-oncogene, 146 

MYBPC3 - Cardiac myosin binding protein C, 
116 

MYC - Myc proto-oncogene, 143, 198 

MYCLI - Myc proto-oncogene homolog 1, 59 

MYFS5 — Myogenic factor 5, 59, 251, 350 

MYF6 — Myogenic factor 6, 59, 251, 350 

MYH6 - Myosin, &-, heavy chain, 67, 404 

MYH7 -— Myosin, B-, heavy chain, 29, 67, 408 

MYLS - Myosin light chain, regulatory, 228 

MYODI1 - Myogenic factor 3, 28, 59, 251, 350 
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MYOG - Myogenin, 59 

NAGA - Acetylgalactosaminidase, o-N-, 350 

NAIP — Neuronal apoptosis inhibitory protein, 
346 

NARS — Aminoacyl-tRNA synthetase 
(asparaginyl), 178 

NCA - Non-specific cross-reacting antigen, 59 

NCAM] - Neural cell adhesion molecule, 59, 
112 

NCFI — Neutrophil cytosolic factor p47—phox, 
277 

NCL — Nucleolin, 176 

NDUFAS - NADH:ubiquinone oxidoreductase 
B13 subunit, 272 

NDUFV2 - Mitochondrial NADH: ubiquinone 
oxidoreductase, 272, 275 

NEFL - Neurofilament protein, 68kDa, 121 

NFI — Neurofibromin, 9, 29, 108, 110, 145, 247, 
269, 272, 274, 396 

NFIC — Nuclear factor I/C, 113 

NFIX — Nuclear factor I/X, 113 

NGFB — Nerve growth factor, 60, 198 

NME!I — Awd, homolog of Drosophila gene, 144 

NOS2A - Nitric oxide synthase, inducible, 246 

NOTCH] - Notch 1, homolog of Drosophila 
gene, 346 

NOTCH4 —- Notch 4, homolog of Drosophila 
gene, 346, 348 

NP — Purine nucleoside phosphorylase, 110 

NPM1 - Nucleophosmin, 273 

NRAS - Neuroblastoma Ras proto-oncogene, 
59, 62 

NTF3 — Neurotrophin 3, 198 

NTFS — Neurotrophin 5, 198 

NTRKI — Neurotrophic tyrosine kinase 
receptor type 1, 60, 198 

NTRK2 — Neurotrophic tyrosine kinase 
receptor type 2, 198 

NTRK3 — Neurotrophic tyrosine kinase 
receptor type 3, 144, 198 

OAT — Ornithine aminotransferase, 343 

OATLI — Ornithine aminotransferase-like 1, 78 

OC90 — Otoconin 90, 243 

OCM - Oncomodulin, 244 

OLRI - Oxidized low density lipoprotein 
receptor 1, 354 

OMG - Oligodendrocyte myelin glycoprotein, 
9, 109, 110 

ORIO0A1 — Olfactory receptor 10A1, 143, 193 

ORIAI - Olfactory receptor 1A1, 143, 193 

ORID2 - Olfactory receptor 1D2, 143, 193 

ORID4 - Olfactory receptor 1D4, 143, 193 

ORIDS — Olfactory receptor 1D5, 143, 193 

ORIE]1 — Olfactory receptor 1E1, 143, 193 

ORIEZ2 - Olfactory receptor 1E2, 143, 193 

ORIFI — Olfactory receptor 1F1, 143, 193 

ORIGI - Olfactory receptor 1G1, 143, 193 

OR2CI1 - Olfactory receptor 2C1, 174 

OR2D2 - Olfactory receptor 2D2, 143, 193 

OR3AI]1 - Olfactory receptor 3A1, 143, 193 

OR3A2 - Olfactory receptor 3A2, 143, 193 

OR3A3 - Olfactory receptor 3A3, 143, 193 

ORSD3 - Olfactory receptor 5D3, 143, 193 


ORSD4 - Olfactory receptor 5D4, 143, 193 

ORSFI — Olfactory receptor 5F1, 143, 193 

OR6A1 — Olfactory receptor 6A1, 143, 193 

ORCIL — Origin recognition complex, 
homolog of yeast gene, 145 

ORM1 - Acid glycoprotein 1, al-, 408 

ORM2 - Acid glycoprotein 2, al-, 408 

OTC - Ornithine carbamoyltransferase, 15 

OXTR - Oxytocin receptor, 110 

P2RY1 — Purinoceptor 1, 108, 348 

P2RY2 — Purinoceptor 2, 348 

P2RY4 — Purinoceptor 4, 348 

P2RY6 — Purinoceptor 6, 348 

P2RY7 — Purinoceptor 7, 348 

P4HB — Prolyl-4 hydroxylase B polypeptide, 
399 

PABP2 - Oculopharyngeal muscular 
dystrophy, 362 

PABPLI - Poly(A) binding protein, 176 

PACE -Paired basic amino acid cleaving 
enzyme (furin), 197 

PACE4 — Paired basic amino acid cleaving 
enzyme 4, 197 

PAH - Phenylalaninehydroxylase, 57, 60, 109, 
392, 393, 394 

PATI — Plasminogen activator inhibitor 1, 114, 
236 

PAI2 — Plasminogen activator inhibitor 2, 192 

PAICS — Phosphoribosylaminoimidazole 
carboxylase, 233 

PAX1 — Paired box homeotic gene 1, 144 

PAX2 — Paired box homeotic gene 2, 144 

PAX3 - Paired box homeotic gene 3, 144 

PAX4 — Paired box homeotic gene 4, 144 

PAXS - Paired box homeotic gene 5, 144 

PAX6 — Paired box homeotic gene 6, 28, 144, 
253 

PAX7 — Paired box homeotic gene 7, 144 

PAX8 — Paired box homeotic gene 8, 144, 253 

PAX9 — Paired box homeotic gene 9, 144 

PBX!1 — Pre-B cell leukemia transcription 
factor 1, 57 

PBX2 — Pre-B cell leukemia transcription 
factor 2, 57, 346, 348 

PBX3 — Pre-B cell leukemia transcription 
factor 3, 57, 346 

PCDH7 - Protocadherin, 183 

PCI - Protein C inhibitor, 192 

PCNA -Proliferating cell nuclear antigen, 111 

PCSKI1 — Proprotein convertase 1, 197 

PCSK2 — Proprotein convertase 2, 197 

PCSK¢4 — Proprotein convertase 4, 197 

PCSKS - Proprotein convertase 5, 197 

PDGFRA -Platelet-derived growth factor 
receptor-c, 58 

PDGFRB - Platelet-derived growth factor 
receptor-, 16, 58 

PDHAI - Pyruvate dehydrogenase a1 subunit, 
66, 108, 411, 412 

PDHA2 - Pyruvate dehydrogenase 02 subunit, 
411, 412 

PEDF — Pigment epithelium-derived factor, 
191 
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PEPB — Peptidase B, 60 

PEPC - Peptidase C, 60 

PEPD - Peptidase D, 60 

PFKM - Phosphofructokinase, muscle, 61, 238 

PFKX — Phosphofructokinase, polypeptide X, 
61 

PGA3 — Pepsinogen A3, 61, 120, 331, 404 

PGA4 — Pepsinogen A4, 61, 120, 331, 404 

PGAS - Pepsinogen A5, 61, 120, 331, 404 

PGD - Phosphogluconate dehydrogenase, 15 

PGGTIB - Protein geranylgeranyltransferase 
type I, B, 275 

PGKI - Phosphoglycerate kinase 1, 406, 411, 
412 

PGKz2 - Phosphoglycerate kinase 2, 411, 412 

PGMs3 - Phosphoglucomutase 3, 390 

PGR — Progesterone receptor, 60, 189 

PGY] - P-glycoprotein 1, 358, 398, 408 

PGY3 — P-glycoprotein 3, 358, 408 

PHB - Prohibitin, 17 

PI — Antitrypsin, œl- (a1 proteinase inhibitor), 
20, 192 

PI10 — Serpin, ovalbumin-like, 192 

PI12 — Neuroserpin, 192 

PI2 — Serpin, ovalbumin-like, 192 

PI3 — Antileukoproteinase (elafin), 403 

PI4 — Kallistatin, 192 

PIS — Maspin, 192 

PI6 — Serpin, ovalbumin-like, 192 

PI7 — Nexin, 192 

PI8 — Cytoplasmic antiproteinase 2, 192 

PI9 — Serpin, ovalbumin-like, 192 

PIGF - Phosphatidylinositol glycan class F 
273 

PIGR - Polymeric immunoglobulin receptor, 
60 

PIM1 — Pim-1 proto-oncogene, 348 

PINI — Dodo, homolog of Drosophila gene , 144 

PKDI - Polycystin, 277, 350 

PKLR - Pyruvate kinase, liver/red blood cell, 
60 

PKM2 - Pyruvate kinase, muscle, 60 

PLA2G1B - Phospholipase AZ, group IB, 60 

PLA2G2A — Phospholipase AZ, group IIA, 60 

PLAT - Plasminogen activator, tissue-type, 
126, 191, 229 

PLAU - Urokinase, 127, 191 

PLAZG2C - Phospholipase AZ, group IIC, 60 

PLG — Plasminogen, 191, 197, 318, 357 

PLI - Antiplasmin, «2-, 192 

PLP - Myelin proteolipid protein, 300 

Pit — Placentally expressed gene, 244 

PLTP — Phospholipid transfer protein, 346 

PMCH - Melanin-concentrating hormone, 351 

PMCHLI - Melanin-concentrating hormone- 
like 1, 351 

PMCHL2 - Melanin-concentrating hormone- 
like 2, 351 

PMM2 — Phosphomannomutase, 277 

PMS1 — Post-meiotic segregation increased 
gene 1 (homolog of yeast gene), 147 

PMS2 — Post-meiotic segregation increased 
gene 2 (homolog of yeast gene), 9, 147 


PNLIP - Lipase, pancreatic, 122, 144 

POLA - DNA polymerase alpha, 66 

POU2FI - POU domain transcription factor 
2F1, 9, 61 

POU2F2 - POU domain transcription factor 
2F2, 61, 254 

POU3F2 - POU domain transcription factor 
3F2, 101 

PPARA - Peroxisome proliferator-activated 
receptor a, 189 

PPARD - Peroxisome proliferator-activated 
receptor 5, 189 

PPARG - Peroxisome proliferator-activated 
receptor Yy, 21, 189 

PPAT - Phosphoribosylpyrophosphate 
amidotransferase, 233 

PPY - Pancreatic polypeptide/pancreatic 
icosapeptide, 25 

PRBI - Proline-rich protein, BstNI subfamily 
1, 351, 354, 404, 408 

PRB2 - Proline-rich protein, BstNI subfamily 
2, 351, 354, 404, 408 

PRB3 - Proline-rich protein, BstNI subfamily 
3, 351, 354, 404, 408 

PRB4 - Proline-rich protein, BstNI subfamily 
4, 351, 354, 404, 408 

PRH1 - Proline-rich protein, HaelllI subfamily 
1, 354, 357 

PRH2 - Proline-rich protein, HaellI subfamily 
2, 354 

PRRKACA -Protein kinase, cAMP-dependent, 
catalytic alpha, 412 

PRKACG - cAMP-dependent protein kinase 
Cy subunit, 412 

PRKCI1 — Protein kinase, class IV, t, 190 

PRKCA -Protein kinase, class I, œ, 190 

PRKCBI -Protein kinase, class I, B, 190 

PRKCD -Protein kinase, class II, 5, 190 

PRKCE -Protein kinase, class III, £, 190 

PRKCG - Protein kinase, class I, y, 190 

PRKCQ -Protein kinase, class II, 0, 190 

PRKCZ -Protein kinase, class IV, C, 190 

PRKDC - DNA-activated protein kinase, 
catalytic subunit , 233 

PRL - Prolactin, 230 

PRLR - Prolactin receptor, 346 

PRM1 -Protamine 1, 247, 300, 342, 348 

PRM2 — Protamine 2, 247, 300, 342, 348 

PROC -Protein C, 126, 191, 427 

PROS1 — Protein, 191, 267 

PRSS1 — Trypsin 1, 191, 196, 197 

PRSS2 — Trypsin 2, 191, 196, 197 

PRY - Tyrosine phosphatase PTP-BL related, 
Y-linked, 79 

PSAP — Prosaposin, 109, 404 

PSG1 — Pregnancy-specific glycoprotein 1, 59, 
226, 247, 300, 347, 348 

PSG11 — Pregnancy-specific glycoprotein 11, 
59, 226, 247, 347, 348 

PSG12 — Pregnancy-specific glycoprotein 12, 
59, 226, 247, 347, 348 

PSG13 — Pregnancy-specific glycoprotein 13, 
59, 226, 247, 347, 348 
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PSG2 — Pregnancy-specific glycoprotein 2, 59, 
226, 247, 347, 348 

PSG3 — Pregnancy-specific glycoprotein 3, 59, 
226, 247, 347, 348 

PSG4 — Pregnancy-specific glycoprotein 4, 59, 
226, 247, 347, 348 

PSGS5 — Pregnancy-specific glycoprotein 5, 59, 
226, 247, 347, 348 

PSG6 — Pregnancy-specific glycoprotein 6, 59, 
226, 247, 347, 348 

PSG7 — Pregnancy-specific glycoprotein 7, 59, 
226, 247, 347, 348 

PSG8 — Pregnancy-specific glycoprotein 8, 59, 
226, 247, 347, 348 

PSMA2 — Proteasome subunit 02, 348 

PTEN - Phosphatase and tensin homolog 
(mutated in multiple advanced cancers) , 
145 

PTGER¢ — Prostaglandin EP4 receptor, 268 

PTH - Parathyroid hormone, 59, 239, 241 

PTHLH - Parathyroid hormone-like hormone, 
59 

PTMA - Prothymosin, 271 

PTN — Pleiotrophin, 239, 243, 244 

PTPRF - Protein tyrosine kinase, receptor 
type, f polypeptide, 59 

PVALB - Parvalbumin, 108 

PZP - Pregnancy zone protein, 59 

QDPR - Quinoid dihydropteridine reductase, 
58 

QM - Jun-associated transcription factor, 145 

OQSCN6 — Quiescin Q6, 399 

RAB3A - Ras proto-oncogene family member, 
Rab3A, 59 

RAB3B - Ras proto-oncogene family member, 
Rab3B, 59, 62 

RAB4 - Ras proto-oncogene family member, 
Rab4, 59, 62 

RABGGTA - Rab geranylgeranyl transferase a 
subunit, 10 

RAD23A - Recombination protein RAD23A, 
homolog of yeast gene, 145 

RAD23B - Recombination protein RAD23B, 
homolog of yeast gene, 145 

RADSIA - DNA repair protein, homolog of E. 
coli gene, 147 

RADS2 — Nucleotide excision repair protein 
RAD52, 272 

RAG] — Recombination-activating protein 1, 
107, 406, 407 

RAG2 — Recombination-activating protein 1, 
107, 406, 407 

RAPIA - Ras proto-oncogene family member, 
RaplA, 59 

RAPIB - Ras proto-oncogene family member, 
Rap1B , 59, 62 

RARA -Retinoic acid receptor &, 189, 190 

RARB -Retinoic acid receptor B, 189, 190 

RARG - Retinoic acid receptor y, 60, 189, 190 

RARS — Aminoacyl-tRNA synthetase (arginyl), 
178 

RBI — Retinoblastoma, 29 

RBM1 - RNA-binding motif protein 1, 79, 80 


RCP — Red cone pigment, 14, 20, 110, 318, 330, 
351, 401, 402, 408 

RECA - Recombination protein, homolog of 
yeast Rad51 gene, 145, 148 

RECQL - DNA helicase, RecQ-like, 148 

RECQLS - DNA helicase, RecQ protein-like 5, 
148 

RECQLA - DNA helicase, RecQ protein-like 4, 
148 

REL — Rel proto-oncogene, 144, 343 

REN - Renin, 61, 120 

RENTI — Regulator of nonsense transcripts 1 , 
115 

RET — Ret proto-oncogene, 145 

RHAG - Rhesus blood group-associated 
antigen, 300 

RHCE - Rhesus blood group, CcEe antigens, 
141, 408 

RHD - Rhesus blood group, D antigen, 141, 
300, 331, 408 

RHO - Rhodopsin, 316 

RLNI — Relaxin 1, 186 

RLN2 — Relaxin 2, 186 

RMSAI — Regulator of mitotic spindle 
assembly 1, 342 

RN5S1 —5S ribosomal RNA, 182 

RN7SK —7SK RNA, nuclear, 238 

RN7SL -7SL RNA, cytoplasmic, 32, 340 

RNASE2 - Eosinophil-derived neurotoxin, 141 

RNASES3 - Eosinophil cationic protein, 141, 
300 

RNE] - Small nucleolar mRNA 1, 9 

RNE2 - Small nucleolar mRNA 2, 9 

RNRI1 — Ribosomal RNA cluster 1, 181, 360, 
410 

RNR2 - Ribosomal RNA cluster 2, 181, 360, 
410 

RNR3 — Ribosomal RNA cluster 3, 181, 360, 
410 

RNR4 — Ribosomal RNA cluster 4, 181, 360, 
410 

RNRS — Ribosomal RNA cluster 5, 181, 360, 
410 

RNUI - U1 small nuclear RNA, 62, 182 

RNU2 - U2 small nuclear RNA, 182, 410 

RPL10 — Ribosomal protein L10, 248 

RPL11 — Ribosomal protein L11, 248 

RPL13 — Ribosomal protein L13, 248 

RPL1S5 — Ribosomal protein L15, 248 

RPLI17 — Ribosomal protein L17, 248 

RPL19 — Ribosomal protein L19, 248 

RPL22 — Ribosomal protein L22, 248 

RPL23A - Ribosomal protein L23A, 248, 271 

RPL24 — Ribosomal protein L24, 248 

RPL27 — Ribosomal protein L27, 248 

RPL27A — Ribosomal protein L27A, 248 

RPL28 — Ribosomal protein L28, 248 

RPL29 — Ribosomal protein L29, 248 

RPL3 — Ribosomal protein L3, 248 

RPL30 — Ribosomal protein L30, 248 

RPL31 — Ribosomal protein L31, 248 

RPL32 — Ribosomal protein L32, 248 

RPL35A - Ribosomal protein L35A, 248 
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RPL36A - Ribosomal protein L36A, 248 

RPL37 — Ribosomal protein L37 , 248 

RPL38 — Ribosomal protein L38, 248 

RPL4 — Ribosomal protein L4, 248 

RPL41 — Ribosomal protein L41, 248 

RPLS5 — Ribosomal protein L5, 248 

RPL6 — Ribosomal protein L6, 248 

RPL7 — Ribosomal protein L7, 248 

RPL8 — Ribosomal protein L8, 248 

RPL9 — Ribosomal protein L9, 248 

RPS10 — Ribosomal protein $10, 248 

RPS11 — Ribosomal protein $11, 248 

RPS12 — Ribosomal protein $12, 248 

RPS13 — Ribosomal protein $13, 248 

RPS14 — Ribosomal protein $14, 248 

RPSISA — Ribosomal protein S15A, 248 

RPS17 — Ribosomal protein $17, 248 

RPS18 — Ribosomal protein $18, 248 

RPS2 — Ribosomal protein S2, 248 

RPS24 — Ribosomal protein $24, 248 

RPS25 — Ribosomal protein $25, 248 

RPS3 — Ribosomal protein S3, 248 

RPS3A — Ribosomal protein S3A, 248 

RPS4X — Ribosomal protein S4, X-linked, 78 

RPS4X/Y — Ribosomal protein S4, 248 

RPS4Y — Ribosomal protein S4, Y-linked, 78, 
79 

RPS5 — Ribosomal protein S5, 248 

RPS6 — Ribosomal protein S6, 248 

RPS7 — Ribosomal protein S7, 248 

RPS8 — Ribosomal protein S8, 248 

RPS9 — Ribosomal protein S9, 248 

RRAS — Ras proto-oncogene homolog , 59 

RXRA — Retinoid X receptor a, 57, 189, 346 

RXRB - Retinoid X receptor B, 57, 189, 346, 
348 

RXRG - Retinoid X receptor y, 57, 189 

RYRI - Ryanodine receptor 1, 59 

RYR2 - Ryanodine receptor 2, 59 

SAA — Serum amyloid A, 267 

SAAI — Serum amyloid Al, 391 

SAA2 — Serum amyloid A2, 391 

SCAI - Spinocerebellar ataxia type 1 (ataxin 1), 
362, 364 

SCA2 - Spinocerebellar ataxia type 2 (ataxin 2), 
362 

SCA7 — Spinocerebellar ataxia type 7, 362 

SCCAI1 — Squamous cell carcinoma antigen 1, 
192 

SCCAZ2 — Squamous cell carcinoma antigen 2, 
192 

SCNIOA — Sodium channel 10, a-subunit, 347 

SCNIA - Sodium channel 1, o-subunit, 347 

SCN2A - Sodium channel 2, o-subunit, 347 

SCN3A — Sodium channel 3, a-subunit, 347 

SCN4A - Sodium channel 4, o-subunit, 23, 347 

SCNSA - Sodium channel 5, o-subunit, 23, 347 

SCN6A — Sodium channel 6, o&-subunit, 347 

SCN7A — Sodium channel 7, o&-subunit, 347 

SCN8A — Sodium channel 8, o-subunit, 23, 347 

SCN9A - Sodium channel 9, o-subunit, 347 

SCYA18 — Small inducible cytokine family A, 
member 18, 400 


SCYA3 — Small inducible cytokine A3, 400 

SCYA3L1 — Small inducible cytokine A3-like 
1, 400 

SCYA3L2 — Small inducible cytokine A3-like 
2, 400 

SDCI — Syndecan 1, 56 

SDC2 — Syndecan 2, 56 

SDC3 — Syndecan 3, 56 

SDC4 — Syndecan 4, 56 

SDFI — Stromal cell-derived factor 1, 87 

SDH] — Succinate dehydrogenase, iron-sulfur 
protein subunit, 144 

SEA - Avian erythroblastosis (S13) proto- 
oncogene homolog, 59 

SEC13L1 — Secretory pathway protein, 
homolog of yeast gene, 145 

SELE - Selectin, E-, 225 

SELP - Selectin, P-, 229 

SEMGI — Semenogelin 1, 110, 330, 354 

SEMG2 — Semenogelin 2, 110, 330, 354, 403 

SFRS2 — Arg/Ser-rich splicing factor, 176 

SFTPA1 - Pulmonary surfactant protein A, 
142, 348 

SFTPD - Pulmonary surfactant protein D, 348, 
350 

SHC1 - SHC transforming protein, 272 

SHH - Sonic hedgehog, 251 

SHMTI - Serine hydroxymethyltransferase, 
cytosolic, 273, 275 

SHOX — Short stature homeobox, 81 

SIAT4C - Sialyltransferase 4C, 61 

SIR2L — SIR2 (Silent mating type information 
regulation 2), homolog of yeast gene , 147 

SLC2A1 — Solute carrier family 2, member 1, 


59 

SLC2A3 - Solute carrier family 2, member 3, 
59 

SLC2AS — Solute carrier family 2, member 5, 
59 

SLC6A10 — Solute carrier family 6, member 10, 
346, 395 

SLC6A4 — Solute carrier family 6, member 4, 
237 


SLC6A8 — Solute carrier family 6, member 8, 
346, 394, 395 

SLC9A3 — Solute carrier family 9, isoform A3, 
272 

SLO — Slowpoke, homolog of Drosophila gene, 
144 

SMA - Spinal muscular atrophy, 346 

SMCX - Selected cDNA on X, mouse, 
homolog of, 303 

SMCY - Selected cDNA on Y, mouse, homolog 
of, 79, 303 

SMN - Survival motor neuron protein, 346 

SNRP70 — Small nuclear ribonucleoprotein, 
70KD polypeptide, 61, 176 

SNRPA - Small nuclear ribonucleoprotein, 
polypeptide A, 61 

SNRPE - Small nuclear ribonucleoprotein, 
polypeptide F 61 

SNRPN - Small nuclear ribonucleoprotein 
polypeptide N, 30, 61, 274 
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SODI1 - Superoxide dismutase 1, 15 

SOD2 — Superoxide dismutase 2, 15 

SOD3 — Superoxide dismutase 3, 15 

SORD - Sorbitol dehydrogenase, 59 

SOS1 — Son of sevenless 1, homolog of 
Drosophila gene, 144 

SOS2 — Son of sevenless 2, homolog of 
Drosophila gene, 144 

SOX1 — SRY-related HMG box 1, 250 

SOX10 — SRY-related HMG box 10, 250 

SOX11 — SRY-related HMG box 11, 250 

SOX2 — SRY-related HMG box 2, 250 

SOX20 — SRY-related HMG box 20, 250 

SOX22 — SRY-related HMG box 22, 250 

SOX3 — SRY-related HMG box 3, 79, 250 

SOX4 — SRY-related HMG box 4, 250 

SOXS — SRY-related HMG box 5, 250 

SOX9 — SRY-related HMG box 9, 250 

SPARC — Osteonectin, 58, 230 

SPRRIA - Small proline-rich protein 1A, 16, 
226, 355, 372, 373 

SPRRIB - Small proline-rich protein 1B, 16, 
226, 355, 372, 373 

SPRR2A - Small proline-rich protein 2A, 16, 
226, 227, 355, 356, 372, 373 

SPRR2C - Small proline-rich protein 2C, 226 

SPRR3 — Small proline-rich protein 3, 16, 226, 
227, 355, 372, 373 

SPTA1 - Spectrin, &-, 356, 397 

SPTB — Spectrin, B-, 356 

SRC — Tyrosine kinase STC, 56 

SRP14 — Alu RNA-binding protein, 363 

Srp46 — SR splicing factor SRp46, 412 

SRY — Sex determining region Y, 78, 79, 80, 
107, 121, 223, 224, 300 

SSR4 — Signal sequence receptor 6, 233 

STATH - Statherin, 58 

STK2 — Protein serine/threonine kinase stk2, 
342 

STS — Steroid sulfatase, 80, 390, 394 

SURFI - Surfeit 1, 66 

SURF2 — Surfeit 2, 66, 234 

SURF3 — Surfeit 3, 66 

SURF4 — Surfeit 4, 66, 234 

SURFS — Surfeit 5, 66, 122 

SYBLI — Synaptobrevin-like 1, 81 

SYNI — Synapsin I, 78 

SYTI1 — Synaptotagmin 1, 319 

SYT2 — Synaptotagmin 2, 319 

SYTS3 — Synaptotagmin 3, 319 

SYT4 — Synaptotagmin 4, 319 

SYTS — Synaptotagmin 5, 319 

T — Brachyury, 249 

TALDOI — Transaldolase, 34, 337 

TAP] — Transporter, ATP-binding cassette, 1, 
231 

TAP2 — Transporter, ATP-binding cassette, 2, 5 

TARS — Aminoacyl-tRNA synthetase 
(threonyl), 178 

TB4Y — Thymosin 4, 79 

TBP — TATA box-binding protein, 348, 360 

TCEAI - Transcriptional elongation factor 
TFUS, 273 


TCFI — Hepatocyte nuclear factor 1a, 122, 250 

TCF3 — Transcription factor 3, 59 

TCP10 — t complex responder, 113 

TCRA — T-cell antigen receptor, o-subunit, 16, 
67, 110, 196, 335, 346 

TCRB — T-cell antigen receptor, §-subunit, 5, 
16, 65, 67, 196, 331, 332, 333, 346, 391 

TCRD - T-cell antigen receptor, 5-subunit, 16, 
110, 196, 346, 410 

TCRE - T-cell antigen receptor, ¢-subunit, 16 

TCRG - T-cell antigen receptor, y-subunit, 16, 
196, 283, 331, 346 

TF — Transferrin, 16, 21, 352, 353 

TFCOUPI - COUP transcription factor 1, 189 

TFCOUP2 - COUP transcription factor 2, 189 

TFRC — Transferrin receptor, 16 

TG — Thyroglobulin, 398 

TGFBI — Transforming growth factor B1 , 21, 
61, 347 

TGFB2 — Transforming growth factor B2 , 61, 
347 

TGFB3 — Transforming growth factor B3, 347 

TGM] — Transglutaminase 1, 10 

TH — Tyrosine hydroxylase, 57, 60, 246, 392, 
393, 394 

THBD - Thrombomodulin, 8, 107 

THBS1 — Thrombospondin 1, 60, 347 

THBS2 — Thrombospondin 2, 347 

THH - Trichohyalin, 16 

THRA - Thyroid hormone receptor « (ear-7), 
9, 189, 190 

THRAL - Thyroid hormone receptor a-like 
(ear-1), 9, 189 

THRB — Thyroid hormone receptor B, 189, 190 

THY] - Thy-1 glycoprotein, 60, 124 

TIMPI — Tissue inhibitor of metalloproteinase 
1, 66, 78, 110 

TKI — Thymidine kinase 1, 15, 230 

TK2 — Thymidine kinase 2, 15 

TKT — Transketolase-related gene, 130 

TLE] — Enhancer of split, homolog of 
Drosophila gene, 144 

TM7SF2 — Transmembrane 7 superfamily, 
member 2, 399 

TNF — Tumor necrosis factor, 140, 171, 236 

TNFR] — Tumor necrosis factor receptor 1, 59 

TNFR2 — Tumor necrosis factor receptor 2, 59 

TNNI]I — Troponin I, 1, 61, 114, 116, 143 

TNNI2 — Troponin I, 2, 114, 143 

TNNI3 — Troponin I, 3, 114, 143 

TNNTI — Troponin T1, skeletal slow, 61 

TNP2 — Transition protein 2, 342 

TNR — Tenascin R, 57 

TNXA — Tenascin XA, 9, 57, 348 

TOP1 — Topoisomerase I, 270 

TP53 — Tumor protein p53, 29 

TPH - Tryptophan hydroxylase, 57, 60, 392, 
393, 394 

TPII — Triose phosphate isomerase, 15, 59, 118 

TPM1 - Tropomyosin, a-, 112 

TPM2 — Tropomyosin, B-, 321 

TPO — Thyroid peroxidase, 122 

TRA] — Tumor rejection antigen 1, 60 
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TRAN -tRNA (alanine), 177 

TRE -tRNA (glutamic acid), 62, 177 

TREL] -tRNA (glutamic acid-like 1), 62 

TRGI -tRNA (glycine), 177 

TRKI — tRNA (lysine), 177 

TRLI -tRNA (leucine 1), 177 

TRL2 - tRNA (leucine 2), 177 

TRMII -tRNA (methionine 1), 177 

TRMI2 - tRNA (methionine 2), 177 

TRN -tRNA (asparagine), 62, 177, 393 

TRNL -tRNA (asparagine-like), 62, 393 

TRP1 — tRNA (proline 1), 177 

TRP2 - tRNA (proline 2), 177 

TRP3 — tRNA (proline 3), 177 

TRPCI - Transient receptor potential channel- 
related protein 1, 144 

TRQI -tRNA (glutamine 1), 177 

TRRI1 -tRNA (arginine 1), 177 

TRR3 - tRNA (arginine 3), 177 

TRR4 - tRNA (arginine 4), 177 

TRSP — tRNA (opal suppressor 
phosphoserine), 178 

TRTI -tRNA (threonine 1), 177 

TRT2 - tRNA (threonine 2), 177 

TRV2 — tRNA (valine 2), 60 

TRV3 — tRNA (valine 3), 60 

TSHB - Thyroid stimulating hormone beta, 60 

TSHR - Thyrotropin receptor, 184 

TSPY — Testis-specific protein, Y-linked, 79 

TTN - Titin, 9 

TTR - Transthyretin, 29, 142 

TTY1 — Testis transcript Y1, 79 

TTY2 — Testis transcript Y2, 79 

TUBA] — Tubulin, a1-, 118 

TUBA2 — Tubulin, «2-, 118 

TUBB — Tubulin, B-, 118, 245, 271 

TUFM - Translation elongation factor Tu, 148, 
272 

TXN — Thioredoxin, 399 

TYK2 - Tyrosine kinase 2, 56, 60 

TYMS - Thymidylate synthase, 110 

TYR — Tyrosinase, 266 

U22 — U22 host gene, 111 

UBAS2 - Ubiquitin A52, 26, 179, 401, 410 

UBA80 - Ubiquitin A80, 179, 401 

UBB - Ubiquitin B, 26, 179, 410 

UBC - Ubiquitin C, 26, 179, 410 

UBE] - Ubiquitin-activating enzyme, 78, 79 

UBTF - Upstream binding transcription factor, 
250 

UGB - Uteroglobin, 228 

UNG - Uracil-DNA glycosylase, 272 

Uog1 — Gene of unknown function, 25 

UOX — Urate oxidase, 280 

UTY - Ubiquitous TPR motif Y, 79 

VARSI — Aminoacyl-tRNA synthetase (valyl 
1), 178, 346 

VARS2 - Aminoacyl-tRNA synthetase (valyl 
2), 346 

VDR - Vitamin D receptor, 60, 189 

VHL - von Hippel-Lindau syndrome, 21 

VILI — Villin 1, 109 

VIL2 — Villin 2, 109 


VIM — Vimentin, 121 

VIP — Vasoactive intestinal 
polypeptide/PHM27, 25 

VWF - von Willebrand factor, 267, 277 

WARS — Aminoacyl-tRNA synthetase 
(tryptophany]l), 178, 393 

WNTI1 - Wingless 1, homolog of Drosophila 
gene, 249 

WRN — DNA helicase (RecQ), Werner 
syndrome-associated , 148 

WTI — Wilms’ tumor protein, 30, 60, 113, 145, 
239, 240, 247 

XG — Xg blood group, 78, 80, 266 

XIST — X (inactive)-specific transcript, 9, 30, 
83 

XKRY — XK-related, Y-linked, 79 

XRCCS5 — XRCC5 DNA repair protein, 229 

YB1 — Y box family member 1, 147 

YES1 — Yes proto-oncogene, 56 

ZFX — Zinc finger, X-linked, 78, 84, 303, 412 

ZFY — Zinc finger, Y-linked, 79, 303, 412 

ZNFIIA — Zinc finger, 11A, 193, 389 

ZNF1I1B — Zinc finger, 11B, 193, 390 

ZNF123 — Zinc finger, 123, 193 

ZNF12S5 — Zinc finger, 125, 193 

ZNF128 — Zinc finger, 128, 193 

ZNF129 — Zinc finger, 129, 193 

ZNF132 — Zinc finger, 132, 193 

ZNF134 — Zinc finger, 134, 193 

ZNF135 — Zinc finger, 135, 193 

ZNF136 — Zinc finger, 136, 193 

ZNF'137 — Zinc finger, 137, 193 

ZNF145 — Zinc finger, 145, 193 

ZNF146 — Zinc finger, 146, 193 

ZNF154 — Zinc finger, 154, 193 

ZNF1S5S5 — Zinc finger, 155, 193 

ZNF16 — Zinc finger, 16, 193 

ZNF160 — Zinc finger, 160, 193 

ZNF 165 — Zinc finger, 165, 193 

ZNF166 — Zinc finger, 166, 193 

ZNF167 — Zinc finger, 167, 193 

ZNF168 — Zinc finger, 168, 193 

ZNF173 — Zinc finger, 173, 193 

ZNF175 — Zinc finger, 175, 193 

ZNF204 — Zinc finger, 204, 193 

ZNF208 — Zinc finger, 208, 193, 396 

ZNF22 — Zinc finger, 22, 193 

ZNF25 — Zinc finger, 25, 193, 389 

ZNF33A - Zinc finger, 33A, 193 

ZNF33B — Zinc finger, 33B, 193, 390 

ZNF34 — Zinc finger, 34, 193 

ZNF35 — Zinc finger, 35, 193 

ZNF37A - Zinc finger, 37A, 193, 389 

ZNF37B — Zinc finger, 37B, 193, 390 

ZNF42 — Zinc finger, 42, 193 

ZNF43 — Zinc finger, 43, 193 

ZNF45 — Zinc finger, 45, 193 

ZNF52 — Zinc finger, 52, 193 

ZNF56 — Zinc finger, 56, 193 

ZNF58 — Zinc finger, 58, 193 

ZNF64 — Zinc finger, 64, 193 

ZNF66 — Zinc finger, 66, 193 

ZNF67 — Zinc finger, 67, 193 
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ZNF69 — Zinc finger, 69, 193 ZNF80 — Zinc finger, 80, 239, 243, 358 
ZNF7 - Zinc finger, 7, 193 ZNF83 — Zinc finger, 83, 193 

ZNF70 — Zinc finger, 70, 193 ZNF85 — Zinc finger, 85, 193 

ZNF71 — Zinc finger, 71, 193 ZNF90 — Zinc finger, 90, 193 

ZNF74 — Zinc finger, 74, 193 ZNF91 — Zinc finger, 91, 141, 193, 251, 342 
ZNF75 — Zinc finger, 75, 272 ZNF92 — Zinc finger, 92, 193 

ZNF76 — Zinc finger, 76, 193 ZNF93 — Zinc finger, 93, 193 


Gene symbols are as recommended by the Human Gene Nomenclature Committee — 
(http://www.gene.ucl.ac.uk/nomenclature/) at the time of going to press. Genes 
described by symbols in lower case letters have not yet been allocated a permanent 
symbol. 


Pseudogene symbols are not included. Symbols are sometimes abbreviated in the 
text when a cluster of genes is being described e.g. HOXA for the genes which 
comprise the homeobox A cluster. 


Index to Human Gene 
Symbols used in text 


(arranged alphabetically by gene) 


Abl proto-oncogene homolog 2 -ABL2, 59 

Abl proto-oncogene-like —-ABLL, 62 

ABO blood group — ABO, 13, 87, 335, 402 

Acetylcholinesterase - ACHE, 398 

Acetylgalactosaminidase, a-N-— NAGA, 350 

Acid glycoprotein 1, &l- -ORM1, 408 

Acid glycoprotein 2, æl- -ORM2, 408 

Acid maltase - GAA, 111 

Acid phosphatase 2, lysosomal — ACP2, 60 

Acid phosphatase 5, tartrate-resistant - ACP5, 
60 

Aconitase 1 —ACOI, 15 

Aconitase 2—ACO2, 15 

Actin, o-, cardiac -ACTC, 15, 61, 151, 152, 223 

Actin, y-, cytoplasmic 2—-ACTGI, 110, 151, 
152, 271 

Actin, B-, non-muscle cytoplasmic — ACTB, 
151, 152, 225, 271 

Actin, o-, skeletal muscle - ACTA, 15, 61, 151, 
152 

Actin, y-, smooth muscle enteric - ACTG2, 15, 
151, 152 

Actin, &-, smooth muscle, aorta — ACTA2, 15, 
151, 152 

Acyl-CoA binding protein — DBI, 273 

Acyl-coenzyme A dehydrogenase, medium 
chain -ACADM, 60 

Acyl-coenzyme A dehydrogenase, short chain — 
ACADS, 60 

Adenine nucleotide translocator 3 - ANT3, 80, 
394 

Adenosine deaminase — ADA, 18, 29 

Adenosylmethionine, S-, decarboxylase — 
AMDI, 273 

Adenylate kinase 2 -AK2, 62 

ADP-ribosyltransferase -ADPRT, 270 

ADP-ribosyltransferase 1 (inactive) -ART2P 
282 

Adrenergic receptor, 01B -ADRA1B, 58, 346 

Adrenergic receptor, B2 -ADRB2, 58, 107, 346 

Adrenergic receptor, a2C - ADRA2C, 58 

Adrenocorticotropic hormone receptor — 
MC2R, 184 

Adrenoleukodystrophy protein - ALD, 145, 
268, 395, 396 

Adult skeletal muscle gene, imprinted 
(D118878E) - H19, 9, 30 

Alanyl aminopeptidase - ANPER 60 


Albumin - ALB, 152, 153, 247, 348 

Albumin, o-/afamin — AFM, 152, 247, 348 

Alcohol dehydrogenase 1 -ADH1, 16, 17, 141 

Alcohol dehydrogenase 2 - ADH2, 16, 17, 141, 
234 

Alcohol dehydrogenase 3 -ADH53, 16, 17, 141 

Alcohol dehydrogenase 4 - ADH4, 16 

Alcohol dehydrogenase 5 - ADHS, 16 

Alcohol dehydrogenase 7 - ADH7, 16 

Aldehyde dehydrogenase 1 - ALDH], 118, 
148 

Aldehyde dehydrogenase 10 -ALDH10, 118, 
148 

Aldehyde dehydrogenase 2 - ALDH2, 88, 118 

Aldehyde dehydrogenase 3 -ALDH3, 118, 148 

Aldehyde dehydrogenase 5 - ALDHS, 118, 148 

Aldehyde dehydrogenase 6 -ALDH6, 118, 148 

Aldehyde dehydrogenase 9 - ALDH, 118, 148 

Aldolase A - ALDOA, 234, 235 

Alkaline phosphatase, intestinal — ALPI, 228, 
347 

Alkaline phosphatase, placental -ALPP 228, 
347 

Alkaline phosphatase, placental-like 2 — 
ALPPL2, 347 

Alu RNA-binding protein - SRP14, 363 

Amelogenin, X-linked -AMELX, 66, 78, 80, 
334 

Amelogenin, Y-linked -AMELY, 78, 79, 80, 334 

Aminoacyl tRNA synthetase (alanyl) - AARS, 
178 

Aminoacyl-tRNA synthetase (arginyl) - RARS, 
178 

Aminoacyl-tRNA synthetase (asparaginyl) — 
NARS, 178 

Aminoacyl-tRNA synthetase (cysteinyl) — 
CARS, 178 

Aminoacyl-tRNA synthetase 
(glutaminyl/prolyl) - EPRS, 178, 393, 
398, 399 

Aminoacyl-tRNA synthetase (glycyl) - GARS, 
178 

Aminoacyl-tRNA synthetase (hystidyl) — 
HARS, 118, 119, 178, 393 

Aminoacyl-tRNA synthetase (isoleucyl) — 
IARS, 178 

Aminoacyl-tRNA synthetase (leucyl) - LARS, 
178 
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Aminoacyl-tRNA synthetase (lysyl) - KARS, 
178 

Aminoacyl-tRNA synthetase (methionyl) — 
MARS, 178 

Aminoacyl-tRNA synthetase (threonyl) — 
TARS, 178 

Aminoacyl-tRNA synthetase (tryptophanyl) — 
WARS, 178, 393 

Aminoacyl-tRNA synthetase (valyl 1) — 
VARSI, 178, 346 

Aminoacyl-tRNA synthetase (valyl 2) — 
VARS2, 346 

Aminolevulinate synthase, 5-, 1 -ALASIJ, 21 

Aminolevulinate synthase, 5-, 2 —- ALAS2, 66, 
78, 110 

Amphiregulin - AREG, 58 

Amylase, pancreatic 2A - AMY2A, 14, 242, 243, 
404, 408 

Amylase, pancreatic 2B -AMY2B, 14, 242, 243, 
404, 408 

Amylase, salivary 1A - AMYIA, 14, 239, 242, 
243, 331, 390, 404, 408 

Amylase, salivary 1B -AMYIB, 239, 242, 243, 
331, 390, 404, 408 

Amylase, salivary 1C -AMYIC, 34, 239, 242, 
243, 338, 404, 408 

Androgen receptor — AR, 66, 189, 362, 364, 
365 

Angiotensin I-converting enzyme — DCP1, 88, 
342, 352 

Angiotensin receptor -AGTRI, 184, 271, 274 

Angiotensinogen — AGT, 61, 192 

Annexin I-ANX1, 114, 225, 348 

Annexin II -ANX2, 114, 348, 353 

Annexin III -ANX3, 58, 348 

Annexin IV -ANX4, 348 

Annexin V - ANXS, 58, 348 

Annexin VI —-ANX6, 58, 114, 225, 348 

Annexin VII —-ANX7, 225, 348 

Annexin VIII -ANX8, 348 

Annexin XI -ANX11, 348 

Annexin XIII - ANX13, 348 

Anti-M llerian hormone — AMH, 61 

Antichymotrypsin, œl- - AACT, 192 

Antileukoproteinase (elafin) — PI3, 403 

Antiplasmin, &2- — PLI, 192 

Antithrombin III - AT3, 60, 192, 236, 237 

Antitrypsin, &l- (a1 proteinase inhibitor) — PI, 
20, 192 

Apolipoprotein Al —APOA1, 60, 152, 154, 223, 
224, 228, 229, 236 

Apolipoprotein A2 —APOA2, 60, 152, 154 

Apolipoprotein A4 —APOA4, 60, 152, 154 

Apolipoprotein B —APOB, 8, 110, 116, 153, 
225, 246 

Apolipoprotein B mRNA-editing enzyme — 
APOBECI, 241, 246 

Apolipoprotein C1 — APOCI, 60, 152, 154, 237, 
238, 266, 300, 350 

Apolipoprotein C2 -APOC2, 60, 152, 154, 237 

Apolipoprotein C3 -APOC3, 60, 152, 154, 225 

Apolipoprotein C4 —-APOC4, 152, 153, 237, 238 

Apolipoprotein D - APOD, 153 


Apolipoprotein E - APOE, 16, 28, 60, 152, 154, 
225, 237, 238 

Apolipoprotein E receptor 2 — LRP8, 130 

Apolipoprotein(a) — LPA, 141, 236, 239, 245, 
318, 357 

Archain, homolog of Drosophila gene -ARCNI, 
144 

Arg/Ser-rich splicing factor — SFRS2, 176 

Arginase, liver -ARGI, 15, 231, 335 

Argininosuccinate lyase - ASL, 15 

Argininosuccinate synthetase — ASS, 15, 17, 
271 

Arylamine N-acetyltransferase -AAC1, 107 

Arylsulfatase A - ARSA, 176 

Arylsulfatase B- ARSB, 176 

Arylsulfatase D - ARSD, 80, 176 

Arylsulfatase E - ARSE, 80, 176 

Arylsulfatase F - ARSE 176 

Aspartylglucosaminidase - AGA, 58 

Ataxin 3 — MJD, 362, 364 

Atonal, homolog of Drosophila gene -ATOH1, 
144 

ATPase, Ca** transporting, cardiac muscle, 
slow twitch 2—ATP2A2, 60 

ATPase, Ca?*+ transporting, plasma membrane 1 
—ATP2B1, 60 

ATPase, Ca?*+ transporting, plasma membrane 2 
—ATP2Bz2, 60 

ATPase, copper transporting, o-polypeptide — 
ATP7TA, 274 

ATPase, Na*/K* transporting, a polypeptide — 
ATPIA3, 60 

ATPase, Na*/K* transporting, a polypeptide- 
like 2-ATPIAL2Z, 60 

ATPase, Na*/K*t, «1 polypeptide -ATPIA1, 
60, 62 

ATPase, Na*/K*, B1 polypeptide -ATPI1B1, 
60, 62, 363 

ATPase, Na*/K*, «2 polypeptide -ATPI1A2, 
60, 62 

Avian erythroblastosis (S13) proto-oncogene 
homolog — SEA, 59 

Awd, homolog of Drosophila gene - NME1, 144 

Axl proto-oncogene —AXL, 124 

Basic protein Y1 — BPYI, 79 

Basic protein Y2 — BPY2, 79 

Beaded filament structural protein 2, phakinin 
— BFSP2, 113 

Biliary glycoprotein - BGP 343, 344 

Blue cone pigment — BCP 316, 334 

Brachyury — T, 249 

Brain-derived neurotrophic factor - BDNF, 198 

Branched chain aminotransferase 1 - BCATI, 
59 

Branched chain aminotransferase 2 — BCAT2, 
59 

Breakpoint cluster region - BCR, 363 

Breast cancer susceptibility, early onset — 
BRCAl1, 140, 238, 239, 240, 274 

Bruton’s tyrosine kinase - BTK, 110, 111 

Cl inhibitor - CINH, 61 

C4b-binding protein a-chain — C4BPA, 15, 62, 
124 


INDEX TO HUMAN GENE SYMBOLS USED IN THE TEXT 459 


C4b-binding protein B-chain — C4BPB, 15, 62, 
280 


Cadherin 1 - CDH], 112, 183 

Cadherin 11 - CDH11, 183 

Cadherin 12 - CDH12, 183, 275 

Cadherin 13 — CDH13, 183 

Cadherin 15 - CDH15, 183 

Cadherin 2 —- CDH2, 183, 363 

Cadherin 3 - CDH3, 183 

Cadherin 5- CDHS, 183 

Calbindin-D9K — CALB3, 229 

Calcitonin-related polypeptide beta - CALCB, 
59 

Calcitonin/calcitonin gene-related peptide a — 
CALCA, 25, 59, 267 

Calcium channel, voltage dependent, P/Q type, 
alA subunit - CACNAIA, 362 

Calcium channel, voltage-dependent, L-type, 
alpha 1C subunit - CACNAIC, 59 

Calmodulin 1 - CALM], 271 

Calmodulin-like gene 1 - CALMLI, 108, 338 

Calmodulin-like gene 3 - CALML3, 108 

Calpain 1, large subunit - CAPNI1, 61 

Calpain 2, large subunit - CAPN2, 61 

Calpain 3 - CAPN3, 61 

Calpain 4, small subunit - CAPN4, 61 

cAMP-dependent protein kinase Cy subunit — 
PRKACG, 412 

Capping protein Cap G - CAPG, 109 

Carbamoyl phosphate synthetase 1 — CPS1, 15, 
118, 148 

Carbonic anhydrase V — CAS, 267 

Carcinoembryonic antigen — CEA, 59, 300, 347 

Carcinoembryonic antigen-like 1 - CEAL1, 59 

Cardiac myosin binding protein C- MYBPC3, 
116 

Cartilage matrix protein - MATNI], 23 

Casein kinase 2-1 subunit — CSNK2A1, 108 

Casein, B- — CSN2, 124, 390 

Catenin, a@E- - CTNNA1], 272 

Cathepsin D - CTSD, 61 

Cathepsin E- CTSE, 61 

Cathepsin H - CTSH, 61 

CD1A antigen - CDIA, 60 

CD1B antigen - CD1B, 60 

CDIC antigen - CDIC, 60, 397 

CD1D antigen — CDID, 60 

CDIE antigen - CDIE, 60 

CD36 antigen - CD36, 347 

CD36 antigen-like protein 1 - CD36L1, 347 

CD36 antigen-like protein 2 - CD36L2, 347 

CD3D antigen — CD3D, 60 

CD3E antigen - CD3E, 60 

CD3G antigen — CD3G, 60 

CD3Z antigen, zeta polypeptide - CD3Z, 9, 60 

CD4 antigen - CD4, 20, 60, 85 

CD48 antigen — CD48, 60 

CD58 antigen — CD58, 62 

CD68 antigen — CD68, 231 

CD8A antigen — CD8A, 239, 240 

Cell division cycle 4-like - CDC4L, 244 

Cell fate determining gene, homolog of 
Caenorhabditis gene — CAGRI, 144 


Cellular retinoic acid-binding protein — 
CRABP1, 60 

Centromeric protein CENP-B - CENPB, 338 

Ceruloplasmin — CP 148 

Chemokine receptor 2 - CCR2, 86 

Chemokine receptor 5 —- CCRS, 87 

Chloride channel 1 - CLCN1, 114 

Chloride channel 5 - CLCN5S, 114 

Chloride channel 6 - CLCN6, 114 

Chloride channel 7 —- CLCN7, 114 

Cholesterol ester transfer protein - CETR 346 

Cholesterol-repressible protein 39B - CHR39B, 
61 

Cholinergic receptor, muscarinic 1 - CHRM1, 
59 

Cholinergic receptor, muscarinic 3 -CHRM3, 
59 

Cholinergic receptor, muscarinic 4 - CHRM4, 
59 

Cholinergic receptor, muscarinic 5 - CHRMS, 
59 

Chorionic gonadotropin, « chain — CGA, 210 

Chorionic gonadotropin, B chain - CGB, 60, 
227, 335, 336 

Chorionic somatomammatropin 1 - CSH1, 141, 
162, 164, 165, 227, 268, 278, 404, 408, 410 

Chorionic somatomammatropin 2 — CSH2, 141, 
162, 164, 165, 227, 268, 278, 404, 408, 410 

Chorionic somatomammotropin-like - CSHL1, 
115, 278, 320, 410 

Chromodomain Y — CDY, 79 

Chymotrypsin - CTRB1, 191, 197 

Chymotrypsin-like protease - CTRL, 197 

Collagen, a1(I) - COLIA1, 110, 157 

Collagen, a1(II) - COL2A1, 60, 108, 109, 157 

Collagen, a1 (IIT) - COL3A1, 156, 157 

Collagen, al(IV) - COL4A1, 156, 157, 231 

Collagen, &1(IX) - COL9A1, 157 

Collagen, «1(V) —- COLSA1, 8, 108, 156, 157, 
346 

Collagen, &1(VI) - COL6A1, 157 

Collagen, o1(VII) - COL7A1, 8, 157 

Collagen, o1(VIII) - COL8A1, 157 

Collagen, «1(X) - COL10A1, 110, 157 

Collagen, o1(XI) - COL11A1, 60, 157 

Collagen, &1(XII) - COLI2A1, 157 

Collagen, o1(XIIT) - COL13A1, 157 

Collagen, al(XIV) - COL14A1, 157 

Collagen, o1(XIX) —- COL19A1, 157 

Collagen, al(XV) - COLISA1, 157 

Collagen, al(XVI) - COLI6A1, 157 

Collagen, ol(XVII) - COL17A1, 157 

Collagen, ol(XVIII) - COLI8A1, 157 

Collagen, «2(I) - COLIA2, 157, 158 

Collagen, a2(IV) - COL4A2, 20, 156, 157, 231 

Collagen, «2(IX) - COL9A2, 157 

Collagen, «2(V) - COLSA2, 156, 157 

Collagen, «2(VI) - COL6A2, 157 

Collagen, o2(VIII) - COL8A2, 60, 157 

Collagen, 02(XI) - COLIIA2, 156, 157, 346 

Collagen, «3(IV) - COL4A3, 156, 157, 231 

Collagen, «3(IX) - COL9A3, 157 

Collagen, «3(VI) - COL6A3, 157 
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Collagen, &4(IV) - COL4A4, 156, 157, 231 

Collagen, a5(IV) - COL4AS, 66, 156, 157, 231 

Collagen, a6(IV) - COL4A6, 156, 157, 231 

Colony stimulating factor 1 - CSF1, 58 

Colony-stimulating factor-1 receptor (Fms 
proto-oncogene) — CSFIR, 16, 58, 270, 
274 

Complement component 1Q a-chain - CIQA, 
15, 60, 62, 153 

Complement component 1Q B-chain — C1QB, 
15, 60, 62, 153 

Complement component 1Q y-chain - C1QG, 
153 

Complement component 6 — C6, 109, 124, 153 

Complement component 7 — C7, 58, 109, 153 

Complement component 9 — C9, 59, 109, 153 

Complement component CIR — CIR, 59, 153 

Complement component C1S — C1S, 59, 153 

Complement component C2 — C2, 153, 171, 346 

Complement component C3 — C3, 59, 154 

Complement component C4A — C4A, 13, 153, 
171, 331, 346, 350, 351 

Complement component C4B — C4B, 13, 153, 
171, 331, 346, 350, 351 

Complement component C5 — C5, 153, 154, 
343, 346 

Complement component C8A — C84, 60, 62, 
153 

Complement component C8B — C8B, 60, 62, 
153 

Complement component C8G — C8G, 153 

Complement component receptor 1 — CRI, 62, 
355, 357 

Complement component receptor 2 — CR2, 62, 
130 

Corticosteroid-binding globulin — CBG, 192 

COUP transcription factor 1 - TFCOUP1, 189 

COUP transcription factor 2— TFCOUP2, 189 

Creatine kinase, mitochondrial 1 -CKMT]I, 61 

Creatine kinase, muscle —- CKM, 61 

Crystallin, y- - CRYGA, 155 

Crystallin, &A- - CRYAA, 130, 155 

Crystallin, BA1- - CRYBA1, 155 

Crystallin, BA2-- CRYBA2, 155 

Crystallin, BA4- -CRYBA4, 155 

Crystallin, oB- - CRYAB, 155, 185 

Crystallin, yB- -CRYGB, 155, 228 

Crystallin, BB1- -CRYBBI1, 155 

Crystallin, BB2--CRYBBz2, 155 

Crystallin, BB3- -CRYBB3, 155 

Crystallin, yC- -CRYGC, 155, 228 

Crystallin, yD- -CRYGD, 155 

Crystallin, yS- -CRYGS, 155 

Cut, homolog of Drosophila gene - CUTL, 144 

Cyclin-dependent kinase inhibitor (p15) — 
CDKNZ2ZB, 408 

Cyclin-dependent kinase inhibitor (p16) — 
CDKN2A, 28, 408 

Cystathionine-B-synthase — CBS, 148 

Cystatin A —- CSTA, 184 

Cystatin B —- CSTB, 184, 362 

Cystatin C —- CST3, 184 

Cystatin S - CST4, 184 


Cystatin SA —- CST2, 184 

Cystatin SN - CST1, 184 

Cysteine and glycine-rich protein 2 -CSRP2, 
272 

Cystic fibrosis transmembrane conductance 
regulator — CFTR, 86, 123, 223, 235, 269, 
352 

Cytidine monophosphate-N-acetylneuraminic 
acid hydroxylase - CMAH, 283 

Cytochrome b-245 - CYBB, 66 

Cytochrome b5 — CYBS, 271 

Cytochrome c —- CYCI, 145 

Cytochrome c oxidase subunit II - MTCO2, 
355 

Cytochrome c oxidase subunit IV - COX4, 84, 
302 

Cytochrome c oxidase subunit V b - COXSB, 
223 

Cytochrome c oxidase subunit VI b - COX6B, 
271 

Cytochrome c oxidase subunit X — COX10, 9 

Cytochrome P450, 11A - CYP11A, 183 

Cytochrome P450, 11B1 - CYP11B31, 183 

Cytochrome P450, 11B2 - CYP11B2, 183 

Cytochrome P450, 17 - CYP17, 183, 184 

Cytochrome P450, 19- CYP19, 183 

Cytochrome P450, 21 - CYP21, 9, 18, 70, 183, 
184, 265, 266, 277, 350, 351, 411 

Cytochrome P450, 24 - CYP24, 183, 185 

Cytochrome P450, 26A1 - CYP26A1, 183 

Cytochrome P450, 27A1 — CYP27A1, 183 

Cytochrome P450, 27B1 —- CYP27B1, 183 

Cytochrome P450, 2A13 — CYP2A13, 183 

Cytochrome P450, 2A6 —- CYP2A6, 183, 408 

Cytochrome P450, 2A7 - CYP2A7, 183, 408 

Cytochrome P450, 2B6 - CYP2B6, 183 

Cytochrome P450, 2B7 - CYP2B7, 183 

Cytochrome P450, 2C10 - CYP2C10, 183 

Cytochrome P450, 2C18 - CYP2C18, 183 

Cytochrome P450, 2C19 - CYP2C19, 183 

Cytochrome P450, 2C8 - CYP2C8, 183 

Cytochrome P450, 2C9 - CYP2C9, 183 

Cytochrome P450, 2D6 - CYP2D6, 183, 335 

Cytochrome P450, 2E - CYP2E, 183 

Cytochrome P450, 2F1 - CYP2F1, 183 

Cytochrome P450, 2J2 —- CYP272, 183 

Cytochrome P450, 3A4 - CYP3A4, 183 

Cytochrome P450, 4A11 —- CYP4A11, 183 

Cytochrome P450, 4B1 - CYP4B1, 183 

Cytochrome P450, 51 - CYP51, 147, 183, 184 

Cytochrome P450, 7A1 - CYP7A1, 183 

Cytochrome P450, Al - CYPIA1, 183 

Cytochrome P450, A2 - CYPIA2, 183 

Cytochrome P450, B1 - CYPIBI1, 183 

Cytokine receptor B4 —- CRFB4, 348 

Cytoplasmic antiproteinase 2 — PI8, 192 

DEAD/H box 1 - DDX1, 144 

DEAD/H box 10 - DDX10, 273 

DEAD/H box 11 - DDX11, 396 

DEAD/H box 12 - DDX12, 396 

DEAD/H box 3, Y-linked - DBY, 79 

Decay accelerating factor - DAF 62, 153, 343 

Defensin Al —DEFA1, 129, 399 
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Defensin A3 — DEFA3, 399 

Defensin A4 —- DEFA4, 129, 399 

Defensin A5 — DEFAS, 129, 399, 400 

Defensin A6 —- DEFA6, 129, 399, 400 

Defensin B1 — DEFB1, 129, 399 

Defensin B2 —- DEFB2, 129 

Dehydrocholesterol, 7-, reductase - DHCR7, 
399 

Deleted in azoospermia — DAZ, 79 

Deleted in azoospermia-like, Y-linked — 
DAZLI, 80 

Deleted in colorectal carcinoma — DCC, 37, 403 

Dentatorubral-pallidoluysian atrophy — 
DRPLA, 362 

Desmin — DES, 121 

Dihydrofolatereductase - DHFR, 58, 270 

Dihydrolipoamide succinyltransferase - DLST, 
273 

Dishevelled, homolog of Drosophila gene — 
DVLI, 144 

DNA helicase (RecQ), Bloom syndrome- 
associated — BLM, 148 

DNA helicase (RecQ), Werner syndrome- 
associated — WRN, 148 

DNA helicase, RecQ protein-like 4 - RECQLA, 
148 

DNA helicase, RecQ protein-like 5 - RECQLS, 
148 

DNA helicase, RecQ-like - RECQL, 148 

DNA polymerase alpha - POLA, 66 

DNA repair protein, homolog of E. coli gene — 
RADSIA, 147 

DNA-activated protein kinase, catalytic 
subunit — PRKDC, 233 

Dodo, homolog of Drosophila gene — PINI, 144 

Dopamine receptor 1 — DRD1, 58, 184 

Dopamine receptor 2 — DRD2, 184 

Dopamine receptor 3 - DRD3, 184 

Dopamine receptor 4 — DRD4, 184 

Dopamine receptor 5- DRDS, 58, 184, 266 

Dual specificity phosphatase 8 - DUSP8, 273 

Dynein, cytoplasmic - DNECL, 145 

Dystrophia myotonica protein kinase - DMPK, 
145, 363 

Dystrophin - DMD, 7, 8, 9, 13, 17, 66, 108, 110, 
222, 335, 338, 345, 362 

Ecotropic viral integration site 2A — EVI2A, 9, 
110 

Ecotropic viral integration site 2B — EVI2B, 9, 
110 

Elastase — ELA1, 59, 191, 197, 281 

Elk-1 proto-oncogene — ELKI, 272 

Emerin — EMD, 390, 391 

Endothelin receptor A - EDNRA, 184 

Endothelin receptor B — EDNRB, 184 

Engrailed 1 — EN1, 249 

Engrailed 2 — EN2, 249 

Enhancer of split, homolog of Drosophila gene — 
TLE1, 144 

Enolase 1 —- ENO1, 59, 271 

Enolase 2 — ENO2, 15, 59 

Eosinophil cationic protein - RNASE3, 141, 
300 


Eosinophil-derived neurotoxin — RNASE2, 141 

Epidermal growth factor — EGE 16, 58 

Epidermal growth factor receptor - EGFR, 16, 
56, 144 

Epstein-Barr virus insertion site 1 - EBVS1, 59 

Epstein-Barr virus modification site 2 — 
EBVM1, 59 

Erb B proto-oncogene homolog 2 —- ERBB2, 56 

Erb B proto-oncogene homolog 3 - ERBB3, 56, 
60 

Erb B proto-oncogene homolog 4 —- ERBB4, 56 

Erythroid protein 4.1 - EPB41, 112 

Esterase A4 — ESA4, 59 

Estrogen receptor — ESR1, 189 

Estrogen-related receptor œ — ESRRA, 273 

Ets 1 proto-oncogene — ETSI, 144, 251 

Ets 2 proto-oncogene — ETS2, 251 

Eukaryotic translation initiation factor 2 — 


EIF2S3, 79 

Ewing sarcoma breakpoint region 1 - EWSRI1, 
108 

Factor IX — F9, 36, 110, 126, 191, 239, 245, 308, 
309, 427 


Factor V — F5, 88, 148 

Factor VII — F7, 12, 16, 22, 23, 87, 126, 191, 236, 
427 

Factor VIII — F8C, 8, 9, 27, 29, 37, 66, 116, 148, 
339, 345, 391 

Factor VIII gene-associated transcript — F8A, 9, 
37, 391 

Factor X — F10, 16, 22, 23, 126, 191, 427 

Factor XI — F11, 58, 126, 191 

Factor XII — F12, 58, 127, 191 

Factor XIII subunit a — FJ3A, 15 

Factor XIII subunit b — F13B, 15, 123 

Farnesyl diphosphate synthase-like 1 — 
FDPSLI, 60 

Fat facets-related, homolog of Drosophila gene — 
DFFRY, 79 

Fatty acid binding protein 3 - FABP3, 272 

Fbr-associated ubiquitously expressed gene — 
FAU, 270 

Ferritin H chain — FTH1, 21, 271 

Ferritin L chain — FTL, 271 

Ferritin, heavy polypeptide-like 1 - FTHL1, 62 

Ferritin, heavy polypeptide-like 2 - FTHL2, 62 

Fertilin, &- - FTNA, 282 

Fes proto-oncogene — FES, 59 

Fetoprotein, a- — AFP 152, 153, 247, 348 

Fgr proto-oncogene homolog — FGR, 56, 59, 
62 

Fibrinogen o-chain — FGA, 15, 109, 247, 348 

Fibrinogen B-chain — FGB, 15, 109, 247, 348, 
391 

Fibrinogen y-chain — FGG, 15, 109, 247, 348, 
391 

Fibroblast growth factor 1 - FGF, 16, 58, 158, 
159 

Fibroblast growth factor 10 - FGF10, 158, 159 

Fibroblast growth factor 11 — FGF11, 158, 159 

Fibroblast growth factor 12 — FGF12, 158 

Fibroblast growth factor 13 —- FGF13, 158 

Fibroblast growth factor 14 — FGF14, 158 
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Fibroblast growth factor 2 — FGF2, 21, 58, 158, 
159 

Fibroblast growth factor 3 — FGF3, 59, 158, 390 

Fibroblast growth factor 4 — FGF4, 59, 158, 
159, 390 

Fibroblast growth factor 5 —- FGFS, 58, 158 

Fibroblast growth factor 6 — FGF6, 59, 158, 159 

Fibroblast growth factor 7 — FGF7, 158, 266 

Fibroblast growth factor 8 - FGF8, 158 

Fibroblast growth factor 9 - FGF9, 158 

Fibroblast growth factor receptor 1 - FGFRI, 
24, 159 

Fibroblast growth factor receptor 2 — FGFR2, 
24, 159 

Fibroblast growth factor receptor 3 - FGFR3, 
24, 58, 159 

Fibroblast growth factor receptor 4 - FGFR4, 
16, 58, 159 

Fibronectin — FN1, 113 

Filaggrin — FLG, 16, 26, 357, 372 

Filamin — FLNI, 390, 391 

Finger on X and Y — FXY, 81 

Flavin-containing monooxygenase 2 - FMO2, 
283, 284 

Flightless, homolog of Drosophila gene — FLI1, 
144 

Follicle stimulating hormone ß — FSHB, 60 

Forkhead FKHL13 - FKAL]3, 121 

Formylpeptide receptor 1 — FPR1, 108, 347 

Formylpeptide receptor-like 1 - FPRL1, 347 

Formylpeptide receptor-like 2 — FPRL2, 347 

Fos proto-oncogene — FOS, 144 

Fragile X mental retardation syndrome type 1 — 
FMR1, 38, 360, 361, 362, 365, 366 

Fragile X mental retardation syndrome type 2 — 
FMRz2, 362 

Frataxin — FRDA, 362, 363 

Fucosyltransferase 1 — FUT], 61, 141, 348 

Fucosyltransferase 2 — FUT2, 12, 61, 141, 348 

Fucosyltransferase 3 - FUT3, 12, 142, 348 

Fucosyltransferase 4 — FUT4, 61, 142, 348 

Fucosyltransferase 5 - FUTS, 142, 348 

Fucosyltransferase 6 — FUT6, 142, 348 

Fucosyltransferase 7 — FUT7, 142, 348 

Fucosyltransferase 8 - FUTS, 142, 348 

Full length retroviral sequence 1 - FRV1, 60 

Full length retroviral sequence 3 - FRV3, 60 

Fyn proto-oncogene — FYN, 56 

G protein-coupled receptor kinase GRK6 — 
GPRK6, 273 

G-protein-coupled receptor 1 - GPR1, 185, 228 

G-protein-coupled receptor 10 -GPRI1O, 185 

G-protein-coupled receptor 12 -GPRI12, 185 

G-protein-coupled receptor 13 -GPR13, 185 

G-protein-coupled receptor 15 - GPRI5, 185 

G-protein-coupled receptor 18 - GPR18, 185 

G-protein-coupled receptor 2— GPR2, 185 

G-protein-coupled receptor 20 - GPR20, 185 

G-protein-coupled receptor 3 - GPR3, 185 

G-protein-coupled receptor 4 -— GPR4, 185 

G-protein-coupled receptor 5- GPRS, 185 

G-protein-coupled receptor 6 - GPR6, 185 

G-protein-coupled receptor 7 - GPR7, 185 


G-protein-coupled receptor 8 - GPR8, 185 

G-protein-coupled receptor 9 - GPR9, 185 

G/T mismatch-binding protein - GTBP 147 

Galactokinase 1 - GALKI, 15 

Galactokinase 2- GALK2, 15 

Galactose 6—-sulfatase -GALNS, 176 

Galactose-1—phosphate uridylyltransferase — 
GALT, 15, 26, 402 

Galactosidase A, a-D--— GLA, 116, 247, 350 

Galactosyltransferase, o-1, 3- -GGTA1, 143, 
281, 332 

Gamma-aminobutyric acid receptor 5 — 
GABRD, 159 

Gamma-aminobutyric acid receptor £ — 
GABRE, 159 

Gamma-aminobutyric acid receptor a1 — 
GABRA1, 58, 159, 392 

Gamma-aminobutyric acid receptor B1 — 
GABRB1 , 58, 159, 392 

Gamma-aminobutyric acid receptor yl — 
GABRGI1, 159, 392 

Gamma-aminobutyric acid receptor p1 — 
GABRRI, 159 

Gamma-aminobutyric acid receptor «2 — 
GABRA2, 58, 159, 392 

Gamma-aminobutyric acid receptor B2 — 
GABRB2, 159, 392 

Gamma-aminobutyric acid receptor y2 — 
GABRG2, 159, 392 

Gamma-aminobutyric acid receptor p2 — 
GABRRZ2, 159 

Gamma-aminobutyric acid receptor «3 — 
GABRA3 , 159, 391 

Gamma-aminobutyric acid receptor B3 — 
GABRB3, 159, 392 

Gamma-aminobutyric acid receptor y3 — 
GABRG3 , 159, 392 

Gamma-aminobutyric acid receptor a4 — 
GABRA4, 159 

Gamma-aminobutyric acid receptor B4 — 
GABRB4, 159 

Gamma-aminobutyric acid receptor a5 — 
GABRAS , 159, 352, 391, 392, 396 

Gamma-aminobutyric acid receptor a6 — 
GABRA6, 159 

GATA-binding protein 1 - GATA1, 66, 78 

Gelsolin — GSN, 109 

Gene of unknown function — Uog1, 25 

Gli-kruppel family member, HKR1 - HKRI, 
60 


Gli-kruppel family member, HKR2 - HKR2, 
60 

Gli-kruppel family member, HKR3 —- HKR3, 
60 


Glial fibrillary acidic protein - GFAP 121 

Glioma-associated proto-oncogene — GLI, 60, 
390 

Globin, B- - HBB, 7, 13, 15, 16, 20, 29, 67, 85, 
86, 111, 162, 223, 244, 268, 408, 411 

Globin, 5- — HBD, 7, 16, 162, 247, 268, 408 

Globin, £- - HBEI, 16, 162, 222, 223, 238, 241, 
268 

Globin, 0- - HBQ1, 161, 240, 268 
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Globin, C- — HBZ, 161, 234, 268, 275, 279, 331, 
351, 404 

Globin, œl- - HBA], 15, 20, 161, 234, 247, 267, 
268, 275, 351, 404, 408 

Globin, &2- - HBA2, 15, 161, 268, 404, 408 

Globin, y2- - HBG1, 7, 16, 162, 222, 223, 230, 
235, 268, 349, 404, 408, 411 

Globin, y2- - HBG2, 16, 162, 222, 230, 235, 268, 
349, 408, 411 

Glucagon — GCG, 124 

Glucocerebrosidase — GBA, 267, 277 

Glucocorticoid receptor —GRL, 58, 189 

Glucosamine 6-sulfatase - GNS, 176 

Glucose dehydrogenase - GDH, 15, 59 

Glucose phosphate isomerase — GPI, 59, 148 

Glucose-dependent insulinotropic peptide — 
GIP 118 

Glucosidase, a, neutral AB - GANAB, 61 

Glucosidase, a, neutral C - GANC, 61 

Glucuronidase, B- - GUSB, 267 

Glutamate receptor subtype 5, metabotropic — 
GRMS, 113 

Glutamic-oxaloacetic transaminase 2—like — 
GOT2L3, 59 

Glutamic-oxaloacetic transaminase 2—like 1 — 
GOT2L1, 59, 62 

Glutamic-oxaloacetic transaminase 2—like 2 — 
GOT2L2, 59, 62 

Glutathione peroxidase type 5 - GPXS, 115 

Glutathione S-transferase 2 —- GSTA2, 390 

Glutathione S-transferase Al — GSTA1, 363 

Glutathione S-transferase, mu-class 1 — 
GSTM1, 60, 330 

Glutathione S-transferase, mu-class 2 — 
GSTM2, 330 

Glutathione S-transferase, mu-class 4 — 
GSTM4, 330 

Glutathione S-transferase, mu-class 5 — 
GSTMS, 330 

Glutathione S-transferase, Pi class - GSTP1, 
60 

Glutathione S-transferase, theta-class 1 — 
GSTT], 330 

Glyceraldehyde-3-phosphate dehydrogenase — 
GAPD, 15, 118, 271 

Glycinamide ribonucleotide synthetase — 
GART, 25 

Glycophorin A - GYPA, 164, 165, 166, 321, 
350, 402, 404, 405, 408 

Glycophorin B - GYPB, 130, 164, 165, 166, 
321, 335, 350, 404, 405 

Glycophorin E - GYPE, 130, 164, 165, 335, 
350, 404, 405, 408 

Glycoprotein, &l-B — A1BG, 60 

Glycoprotein, &2-HS- - AHSG, 184 

Granulocyte macrophage colony-stimulating 
factor receptor o-chain - CSF2RA, 16,78, 
80, 81, 109 

Granulocyte-macrophage colony-stimulating 
factor -GMCSE 16 

Granzyme A - GZMA, 58 

Green cone pigment — GCR 14, 20, 110, 316, 
330, 351, 401, 402, 408 


Group-specific component/vitamin D-binding 
globulin — GC, 153, 247, 338 

Growth factor, yeast Erv1—homologous — 
GFER, 399 

Growth hormone 1, pituitary - GHI, 10, 16, 20, 
110, 141, 162, 164, 165, 222, 227, 254, 268, 
278, 302, 321, 322, 404, 410 

Growth hormone 2, placental - GH2, 16, 141, 
162, 164, 165, 227, 268, 278, 321, 322, 404, 
410 

Growth hormone gene-derived transcriptional 
activator - GHDTA, 10 

Growth hormone receptor — GHR, 16, 346 

Growth/differentiation factor 1 - GDFI, 25 

GTP-binding protein Gaq - GNAQ, 160, 272 

Guanine nucleotide binding protein, alpha 11 — 
GNA11, 160 

Guanine nucleotide binding protein, alpha 12 — 
GNA12, 160 

Guanine nucleotide binding protein, alpha 13 — 
GNA13, 160 

Guanine nucleotide binding protein, alpha 14 — 
GNA 14, 160 

Guanine nucleotide binding protein, alpha 15 — 
GNA1IS, 160 

Guanine nucleotide binding protein, alpha 
activating activity polypeptide, olfactory 
type -GNAL, 160 

Guanine nucleotide binding protein, alpha 
activating activity polypeptide O — 
GNAOI, 160 

Guanine nucleotide binding protein, alpha 
inhibiting activity 2 - GNAI2, 160 

Guanine nucleotide binding protein, alpha 
inhibiting activity 3 - GNATJ3, 59, 160 

Guanine nucleotide binding protein, alpha 
inhibiting activity polypeptide 1 — 
GNAII, 160 

Guanine nucleotide binding protein, alpha 
inhibiting activity polypeptide 2-like — 
GNAI2L, 59 

Guanine nucleotide binding protein, alpha 
stimulating activity polypeptide 1 — 
GNAS1, 160 

Guanine nucleotide binding protein, alpha 
transducing activity 1 - GNATI, 160 

Guanine nucleotide binding protein, alpha 
transducing activity 2— GNAT2, 59, 
160 

Guanine nucleotide binding protein, alpha Z — 
GNAZ, 160 

Guanine nucleotide binding protein, beta 
polypeptide 1 - GNB1, 59 

Guanine nucleotide binding protein, beta 
polypeptide 3 - GNB3, 59 

Guanylate kinase 1 - GUK1, 62 

Gulono-y-lactone, L-, oxidase - GULOR 282, 
320 

H factor — HF1, 153 

Haptoglobin — HP 141, 282, 329, 337, 351, 404, 
408 

Haptoglobin-related protein — HPR, 141, 282, 
404, 408 
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Harvey Ras proto-oncogene — HRAS, 56, 58, 
245 

Heat shock protein 27 - HSPB2, 185 

Heat shock protein 90, A —- HSPCA, 185 

Heat shock protein 90, B —- HSPCB, 110, 185 

Heat shock protein, 70 kDa, AlA — HSPAIA, 
18, 108, 185, 346 

Heat shock protein, 70 kDa, A1B — HSPA1B, 


185, 346 

Heat shock protein, 70 kDa, AlL — HSPAIL, 
185, 346 

Heat shock protein, 70 kDa, A2 — HSPA2, 108, 
185 


Heat shock protein, 70 kDa, A3 — HSPA3, 185 

Heat shock protein, 70 kDa, A4 — HSPA4, 185 

Heat shock protein, 70 kDa, A5 - HSPAS, 185 

Heat shock protein, 70 kDa, A6 — HSPA6, 185 

Heat shock protein, 70 kDa, A7 - HSPA7, 185 

Heat shock protein, 70 kDa, A8 — HSPA§8, 185 

Heat shock protein, 70 kDa, A9 - HSPAY, 185 

Hemochromatosis — HFE, 171 

Hemopexin — HPX, 123 

Heparin cofactor II - HCF2, 192 

Hepatocyte nuclear factor la — TCFI, 122, 
250 

Hepatocyte nuclear factor 4 - HNF4A, 189 

Hermansky-Pudlak syndrome — HPS, 23 

HERV-H-associating 1 - HHLA1, 243 

Heterogeneous nuclear ribonuclear riboprotein 
Al —-HNRPAI1, 176 

Hexokinase 1 — HK1, 146 

Hexokinase 2 — HK2, 146, 273 

Hexokinase 3 — HK3, 146 

Hexosaminidase B, beta polypeptide - HEXB, 
58 

High mobility group protein 1 - HMGI1, 146, 
250 

High mobility group protein 14 - HMG14, 250 

High mobility group protein 17 -HMG17, 250 

High mobility group protein 2 - HMG2, 146, 
250 

High mobility group protein 4 -HMG4, 250 

Histidine-rich glycoprotein - HRG, 184 

Histone 1 family 2 - H1F2, 14, 60, 180 

Histone 1 family 3 - H1F3, 14, 180 

Histone 1 family 4- H1F4, 14, 60, 180 

Histone 1, family 1 —- H1F1, 180, 228 

Histone 1, family 5 - H1F5, 180 

Histone 2 family A — H2A, 14, 180, 181, 231 

Histone 2 family B - H2B, 14, 180, 181, 231 

Histone 3 family 2 - H3F2, 14 

Histone 4 family 2 - H4F2, 14 

Histone deacetylase - HDAC], 145 

Histone H1° — H1F0, 180, 228 

Histone H1, testis-specific — HIFT, 180 

Histone H2A.X — H2AX, 180 

Histone H2A.Z — H2AZ, 180 

Histone H3.3A — H3F3A, 180 

Histone H3.3B — H3F3B, 180 

HLA-associated transcript 2 (BAT2) — 
D6S51E, 342 

HMG CoA synthase, cytoplasmic - HMGCS1, 
349 


HMG CoA synthase, mitochondrial — 
HMGCS2, 349 

Homeobox 7 — MSX1, 167 

Homeobox Al — HOXA], 166 

Homeobox A10 — HOXA10, 166 

Homeobox All — HOXA11, 166 

Homeobox A13 - HOXA13, 166 

Homeobox A2 — HOXA2, 166 

Homeobox A3 — HOXA3, 166 

Homeobox A4 — HOXA4, 166 

Homeobox A5 — HOXAS, 166 

Homeobox A6 — HOXA6, 166 

Homeobox A7 — HOXA7, 166 

Homeobox A9 — HOXA9, 166 

Homeobox B1 - HOXBI, 166 

Homeobox B13 - HOXB13, 166 

Homeobox B2 - HOXB2, 166 

Homeobox B3 - HOXB3, 166 

Homeobox B4 - HOXB4, 166 

Homeobox B5 — HOXBS, 166 

Homeobox B6 - HOXB6, 166 

Homeobox B7 — HOXB7, 166 

Homeobox B8 — HOXB8, 166 

Homeobox B9 — HOXB9, 166 

Homeobox C10 — HOXC10, 167 

Homeobox C11 — HOXC11, 167 

Homeobox C12 - HOXC12, 167 

Homeobox C13 — HOXC13, 167 

Homeobox C4 - HOXC4, 166 

Homeobox C5 — HOXCS, 166 

Homeobox C6 — HOXC6, 167 

Homeobox C8 — HOXC8, 167, 249 

Homeobox C9 — HOXC9, 167 

Homeobox D1 — HOXD1, 167 

Homeobox D10 - HOXD10, 167 

Homeobox D11 - HOXD11, 167 

Homeobox D12 - HOXD12, 167 

Homeobox D13 - HOXD13, 167, 362 

Homeobox D3 - HOXD3, 167 

Homeobox D4 - HOXD4, 167 

Homeobox D8 — HOXD8, 167 

Homeobox D9 — HOXD9, 167 

Homeobox, H6 family 2 - HMX2, 167 

Hormone receptor TR3 — HMR, 60 

Huntingtin — HD, 362, 364, 365 

Hydroxysteroid dehydrogenase, 3ß-, type I — 
HSD3Bz2, 110 

Hypoxanthinephosphoribosyltransferase — 
HPRT]I, 29, 331 

I factor — IF 58, 153 

Iduronate-2-sulfatase — IDS, 176, 268, 346 

Immunoglobulin A-like polypeptide 1 — JGLL1, 
277 

Immunoglobulin D segments (orphon) — 
IGHDY2, 195 

Immunoglobulin E receptor, Fc fragment, IA — 
FCERIA, 60, 140 

Immunoglobulin E receptor, Fc fragment, IB — 
FCERIB, 60, 110, 140 

Immunoglobulin E receptor, Fc fragment, IG — 
FCERIG, 60, 239, 240 

Immunoglobulin E receptor, Fc fragment, II — 
FCER2, 60 
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Immunoglobulin G receptor, Fc fragment, 2A — 
FCGR2A, 60, 401 

Immunoglobulin G receptor, Fc fragment, 2B — 
FCGR2B, 60, 401 

Immunoglobulin G receptor, Fc fragment, 2C — 
FCGR2C, 60, 400, 401 

Immunoglobulin G receptor, Fc fragment, 3A — 
FCGR3A, 60 

Immunoglobulin G receptor, Fc fragment, 3B — 
FCGR3B, 60 

Immunoglobulin G receptor, Fc fragment, IA — 
FCGRIA, 65 

Immunoglobulin G receptor, Fc fragment, IB — 
FCGRIB, 65 

Immunoglobulin G receptor, Fc fragment, IC — 
FCGRIC, 65, 140 

Immunoglobulin heavy chain, constant 6 — 
IGHD, 194, 245, 346 

Immunoglobulin heavy chain, constant u — 
IGHM, 194 

Immunoglobulin heavy chain, constant œl — 
IGHA1, 194, 408, 410 

Immunoglobulin heavy chain, constant £1 — 
IGHE, 194, 267 

Immunoglobulin heavy chain, constant yl — 
IGHGI, 194, 410 

Immunoglobulin heavy chain, constant 02 — 
IGHA2, 194, 408 

Immunoglobulin heavy chain, constant y2 — 
IGHG2, 194 

Immunoglobulin heavy chain, constant y3 — 
IGHG3, 194 

Immunoglobulin heavy chain, constant y4 — 
IGHG4, 194, 331, 352 

Immunoglobulin heavy chain, D segments — 
IGHDY, 194 

Immunoglobulin heavy chain, J segments — 
IGHF, 194, 245 

Immunoglobulin heavy chain, variable 
segments — JGHYV, 194, 195, 267, 272, 274, 
278, 352, 408 

Immunoglobulin light chain, kappa, constant — 
IGKC, 195, 346 

Immunoglobulin light chain, kappa, J segments 
—IGKf, 195 

Immunoglobulin light chain, kappa, variable 
segments —IGKV, 194, 267, 391, 396, 401 

Immunoglobulin light chain, lambda, constant 
—IGLC, 195, 271, 346, 410 

Immunoglobulin light chain, lambda, J 
segments — IGLJ, 195 

Immunoglobulin light chain, lambda, variable 
segments —IGLV, 195 

Immunoglobulin V,, segments (orphon) — 
IGHV2, 195 

Immunoglobulin V,, segments (orphon) — 
IGHV3, 195 

Imprinted in Prader-Willi syndrome — IPW; 9 

Indian hedgehog — JHH, 251 

Indoleamine 2,3 dioxygenase — IDO, 109, 197 

Inhibitor of DNA binding — JD2, 108 

Insulin — INS, 16, 30, 56, 60, 186, 198, 245, 393 

Insulin promoter factor 1 — IPF1, 335 


Insulin receptor — INSR, 16, 60, 186, 198 

Insulin receptor-related — INSRR, 60 

Insulin-like 2 —INSL2, 60 

Insulin-like growth factor 1 — JGFI, 56, 60, 143, 
186, 198 

Insulin-like growth factor 1 receptor —-JGFIR, 
60, 198 

Insulin-like growth factor 2 — IGF2, 30, 56, 60, 
130, 143, 186, 245, 393 

Insulin-like growth factor 2 receptor —-JGF2R, 
30, 393 

Integrin, ol — ITGA1, 168, 169 

Integrin, B1 -ITGBI, 169 

Integrin, 02 —-ITGA2, 168, 169 

Integrin, B2 —-ITGB2, 169 

Integrin, «2b — ITGA2B, 15, 169, 247, 348 

Integrin, «3 — ITGA3, 15, 168, 169, 247, 348 

Integrin, B3 — ITGB3, 169 

Integrin, 04 — ITGA4, 168, 169 

Integrin, B4 — ITGB4, 169 

Integrin, a5 — ITGAS, 168, 169 

Integrin, B5 —ITGBS, 169 

Integrin, 06 —ITGA6, 168, 169 

Integrin, B6 — ITGB6, 169 

Integrin, «07 — ITGA7, 168, 169 

Integrin, B7 — ITGB7, 169 

Integrin, 08 — ITGA8, 168, 169 

Integrin, B8 — ITGB8, 169 

Integrin, «9 — ITGA9, 168, 169 

Integrin, oD — ITGAD, 168, 169 

Integrin, «E —-ITGAE, 168, 169 

Integrin, «L —ITGAL, 168, 169 

Integrin, oM —ITGAM, 168, 169 

Integrin, «V —ITGAV, 169 

Integrin, oX —ITGAX, 169 

Inter--trypsin inhibitor heavy chain 1 — 
ITIHI1, 391 

Inter-o-trypsin inhibitor heavy chain 2 — 
ITIH2, 391 

Inter--trypsin inhibitor heavy chain 3 — 
ITIH3, 391 

Inter-o-trypsin inhibitor heavy chain 4 — 
ITIH4, 391 

Interferon B — JFNB1, 16, 108, 186, 187, 225 

Interferon y- JFNG, 16, 186, 189 

Interferon a1 — IFNA1, 186, 187, 408 

Interferon œl — JFNW1, 16, 186, 187 

Interferon «10 —JFNA10, 186, 187, 188, 333 

Interferon «13 —JFNA13, 186, 187, 408 

Interferon «14 —IFNA14, 186, 187 

Interferon «16 —JFNA16, 186, 187 

Interferon «17 — IFNA17, 186, 187 

Interferon «2 —IFNA2, 186, 187, 188, 333 

Interferon «21 —IFNA21, 186, 187 

Interferon a4 —IFNA4, 186, 187 

Interferon a5 — IFNAS, 186, 187 

Interferon 06 — IFNA6, 186, 187 

Interferon «7 — IFNA7, 186, 187 

Interferon 08 —IFNA8, 186, 187 

Interferon receptor a, B, œ, 1 - IFNARI, 16, 
199, 346, 348 

Interferon receptor a, B, œ, 2—IFNAR2, 199 

Interferon receptor yl - IFNGRI1, 16 
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Interferon receptor y2 — IFNGR2, 199 

Interleukin 11 receptor a — IL11RA, 26, 402 

Interleukin 12B — JL12B, 16 

Interleukin 13 — JL13, 16 

Interleukin 2 — IL2, 58 

Interleukin 3 — JL3, 16, 58 

Interleukin 3 receptor a — JL3RA, 16, 80, 81 

Interleukin 4 — IL4, 16, 58 

Interleukin 4 receptor —JL4R, 198 

Interleukin 5 — ILS, 16, 58 

Interleukin 6 — IL6, 236 

Interleukin 8 — JL 8, 198 

Interleukin 9 — IL9, 16, 58, 396 

Interleukin 9 receptor — ILIR, 396 

Interleukin-1 receptor antagonist — JLIRN, 
199, 321 

Interleukin-1a@-—JLIA, 199, 321 

Interleukin-18 — ILIB, 199, 321 

Interleukin-6 signal transducer — IL6ST, 272 

Interleukin-8 receptor B — JL8RB, 198, 199, 267 

Interleukin-8-receptor A — IL8RA, 198, 199, 
267 

Involucrin — IVL, 16, 365, 367, 373 

Islet amyloid polypeptide — IAPR 59, 270 

Isocitrate dehydrogenase — IDH3G, 233 

Isovaleryl coenzyme A dehydrogenase — IVD, 
60 

Janus tyrosine kinase 1 — JAKI, 56 

Janus tyrosine kinase 2 - FA K2, 56 

Janus tyrosine kinase 3 —- JA K3, 56 

Jun B proto-oncogene — JUNB, 59 

Jun D proto-oncogene — FUND, 59 

Jun proto-oncogene — FUN, 59, 107, 144 

Jun-associated transcription factor — QM, 145 

Kallikrein, glandular — KLK2, 59 

Kallikrein, plasma — KLK3, 58, 191 

Kallikrein, renal/pancreatic/salivary - KLK1, 
59 

Kallistatin — PI4, 192 

Kallmann syndrome 1 — KALI, 78, 267 

Keratin 1 — KRT1, 169, 170 

Keratin 10 -KRT10, 169, 170 

Keratin 12 —- KRT12, 170 

Keratin 13 — KRT13, 170 

Keratin 14 — KRT14, 169, 170, 266 

Keratin 15 - KRT15, 169, 170 

Keratin 16 - KRT16, 169, 170 

Keratin 17 - KRT17, 169, 170 

Keratin 18 - KRT18, 169, 170, 238, 239 

Keratin 19 - KRT19, 169, 170 

Keratin 2A — KRT2A, 169, 170 

Keratin 3 — KRT3, 170 

Keratin 4 —- KRT4, 170 

Keratin 5 — KRT5, 169, 170 

Keratin 6A — KRT6A, 169, 170 

Keratin 6B — KRT6B, 169, 170 

Keratin 7 — KRT7, 170 

Keratin 8 — KRT8, 169, 170 

Keratin 9- KRT9, 169, 170 

Keratin, hair, acidic 1 - KRTHA1, 170 

Keratin, hair, acidic 2 - KRTHA2, 170 

Keratin, hair, acidic 3A - KRTHA3A, 170 

Keratin, hair, acidic 3B - KRTHA3B, 170 


Keratin, hair, acidic 4 -- KRTHA4, 170 

Keratin, hair, acidic 5 - KRTHAS, 170 

Keratin, hair, basic 1 - KRTHBI, 170 

Keratin, hair, basic 2 — KRTHB2, 170 

Keratin, hair, basic 3 -KRTHB3, 170 

Keratin, hair, basic 4 - KRTHB4, 170 

Keratin, hair, basic 5- KRTHBS, 170 

Keratin, hair, basic 6 - KRTHB6, 170 

Ketohexokinase — KHK, 355 

Ki-Ras 1 proto-oncogene — KRAS2, 56, 59, 272 

Kininogen — KNG, 184, 352 

Kit proto-oncogene — KIT, 58 

Lactalbumin, a-- LALBA, 142, 349 

Lactate dehydrogenase A - LDHA, 15, 59, 349 

Lactate dehydrogenase A-like 2— LDHAL2, 
59 

Lactate dehydrogenase B - LDHB, 15, 59, 349 

Lactate dehydrogenase C - LDHC, 59 

Lactoferrin — LTF, 229 

Lamin B - LMNB2, 108 

Lamin B receptor — LBR, 399 

Laminin receptor, 68 kDa - LAMRI, 9 

Lecithin:cholesterol acyltransferase - LCAT, 
113, 343 

Lectin-like type II integral membrane protein — 
KLRCI, 342 

Leukemia inhibitory factor receptor — LIFR, 
113, 339 

Leukocyte tyrosine kinase — LTK, 60 

Lipase, hepatic — LIPC, 59, 122, 144 

Lipase, hormone-sensitive — LIPE, 59, 236 

Lipase, pancreatic - PNLIP 122, 144 

Lipoprotein lipase — LPL, 13, 122, 144, 236 

Loricrin — LOR, 16, 372, 373 

Low density lipoprotein receptor - LDLR, 16, 
29, 60, 123 

Low density lipoprotein-related protein — 
LRP1, 60 

Luteinizing hormone B — LHB, 60, 227, 336 

Lymphoblastic leukemia-derived sequence 1 — 
LYLI, 59 

Lymphocyte-specific protein, tyrosine kinase — 
LCK, 59, 62 

Lymphoid enhancer-binding factor 1 - LEFI, 
250 

Lymphotactin & — LTNA, 274 

Lymphotactin B — LTNB, 274 

Lysosomal-associated membrane protein 2 — 
LAMP2, 66 

Lysozyme — LYZ, 301, 342, 349 

Macroglobulin, «2--—A2M, 59, 154, 267 

Macrophage colony-stimulating factor —- CSF2, 
16, 58 

Macrophage mannose receptor - MRCI1, 353 

MADS box enhancing factor 2A - MEF2A, 56, 
275 

MADS box enhancing factor 2B - MEF2B, 56 

MADS box enhancing factor 2C - MEF2C, 56 

MADS box enhancing factor 2D - MEF2D, 56 

Major histocompatibility complex class I- 
related A — MICA, 171 

Major histocompatibility complex class I- 
related B — MICB, 171 
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Major histocompatibility complex class I- 
related C— MICC, 171 

Major histocompatibility complex class I- 
related D — MICD, 171 

Major histocompatibility complex class I- 
related E- MICE, 171 

Major histocompatibility complex, class Ia, A — 
HLA-A, 171, 173, 227, 267, 278 

Major histocompatibility complex, class Ia, B — 
HLA-B, 140, 171, 173, 227, 411 

Major histocompatibility complex, class Ia, C — 
HLA-C, 171, 173 

Major histocompatibility complex, class Ib, E — 
HLA-E, 171, 173, 174, 267 

Major histocompatibility complex, class Ib, F — 
HLA-F 171, 173, 267 

Major histocompatibility complex, class Ib, G — 
HLA-G, 171, 173, 267 

Major histocompatibility complex, class II, 
DNA - HLA-DNA, 171 

Major histocompatibility complex, class II, 
DQA1 - HLA-DQAI], 171 

Major histocompatibility complex, class II, 
DQA2 - HLA-DQA2, 171 

Major histocompatibility complex, class II, 
DQB1 - HLA-DQBI, 277, 408 

Major histocompatibility complex, class II, 
DRA - HLA-DRA, 171, 411 

Major histocompatibility complex, class II, 
DRB1 - HLA-DRBI, 86, 171, 344 

Major histocompatibility complex, class II, 
DRB2 - HLA-DRB2, 171 

Major histocompatibility complex, class II, 
DRB3 - HLA-DRB3, 171 

Major histocompatibility complex, class II, 
DRB4 - HLA-DRB4, 171 

Major histocompatibility complex, class II, 
DRB5 - HLA-DRBS, 171 

Major histocompatibility complex, class II, 
DRB6 - HLA-DRB6, 121, 239, 244, 399 

Mammaglobin 1 - MGBI, 228 

Mammaglobin 2 - MGB2, 228 

Mannose phosphate isomerase — MPI, 59 

Mannose-binding protein - MBP 267, 404 

Mannosidase, alpha A, cytoplasmic - MANA1, 
61 

Mannosidase, alpha B, lyosomal - MANB, 61 

Maspin — PIS, 192 

Mast cell chymase - CMA], 191 

Matrix metalloproteinase 1 - MMPI, 236, 348 

Matrix metalloproteinase 10 -MMP10, 348 

Matrix metalloproteinase 11 -MMP11, 348 

Matrix metalloproteinase 12 -MMP12, 348 

Matrix metalloproteinase 13 - MMP13, 348 

Matrix metalloproteinase 14 -MMP14, 348 

Matrix metalloproteinase 15 - MMP1S, 348 

Matrix metalloproteinase 16 - MMP16, 348 

Matrix metalloproteinase 19 - MMP19, 348 

Matrix metalloproteinase 2 —- MMP2, 348 

Matrix metalloproteinase 3 - MMP3, 348 

Matrix metalloproteinase 7 - MMP7, 348 

Matrix metalloproteinase 8 - MMP8, 348 

Matrix metalloproteinase 9- MMP9, 348 


Max protein — MAX, 199 

Melanin-concentrating hormone - PMCH, 351 

Melanin-concentrating hormone-like 1 — 
PMCHLI, 351 

Melanin-concentrating hormone-like 2 — 
PMCHL2, 351 

Membrane cofactor protein - MCR 153 

Membrane component, surface marker 2 — 
M1752, 238 

Membrane component, surface marker 2 — 
M1752, 238 

Metallothionein 2A — M72A, 18 

Metaxin — MTX, 267 

Methylguanine, 06-, DNA methyltransferase — 
MGMT, 110 

Methylthioadenosine phosphorylase - MTAB 
273 

MIC2Z antigen — MIC2, 78, 80 

Microfibril-associated glycoprotein 2 — 
MAGP2, 109 

Microfibril-associated protein 2 - MFAP2, 
109 

Microglobulin, B2- —- B2M, 124, 172 

Microseminoprotein, B- - MSMB, 300 

Microsomal glutathione S-transferase 1 — 


MGSTI, 60 

Microtubule-associated protein 1A - MAPIA, 
25 

Microtubule-associated protein 1B - MAPIB, 
25 


Mineralocorticoid receptor — MLR, 58, 189 

Minibrain, homolog of Drosophila gene — 
MNBH, 144 

Minichromosome maintenance, yeast homolog 
2 —- MCM2, 145 

Minichromosome maintenance, yeast homolog 
3 - MCM3, 145 

Minichromosome maintenance, yeast homolog 
4 —- MCM4, 145, 233 

Minichromosome maintenance, yeast homolog 
5 -MCM5S, 145 

Minichromosome maintenance, yeast homolog 
6- MCM6, 145 

Minichromosome maintenance, yeast homolog 
7-MCM7, 145 

Mismatch repair protein, E. coli mutL homolog 
1 - MLH], 147 

Mismatch repair protein, E. coli mutS homolog 
2 —MSH2, 147 

Mitochondrial NADH: ubiquinone 
oxidoreductase - NDUFV2, 272, 275 

Mitochondrial-3-glycerol phosphate 
dehydrogenase 2 - GPD2, 272 

Mitotic regulator, CHC1 — CHC1, 9, 112, 353, 
354 

MOR2 transcription factor -MOK2, 193, 253 

Molybdenum cofactor synthesis protein — 
MOCOD, 25 

Monoamine oxidase A - MAOA, 78, 109, 236 

Monoamine oxidase B — MAOB, 109 

MTG8/ETO proto-oncogene - CBFA2T1, 112 

Mucin 1, transmembrane — MUCI, 61, 175, 
357, 372 
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Mucin 2, intestinal/tracheal - MUC2, 9, 17, 61, 
175, 176, 357 

Mucin 3, intestinal - MUC3, 9, 175 

Mucin 4, tracheobronchial — MUC4, 9, 175, 357 

Mucin 5, tracheobronchial - MUCSB, 9, 17, 61, 
116, 175, 176 

Mucin 5AC, tracheobronchial/gastric — 
MUCSAG, 9, 17, 61, 175, 176 

Mucin 6, gastric - MUC6, 9, 17, 175, 176 

Mucin 7, salivary - MUC7, 175 

Mucin 8, tracheobronchial — MUC8, 175 

Myb proto-oncogene — MYB, 146 

Myc proto-oncogene — MYC, 143, 198 

Myc proto-oncogene homolog 1 - MYCL1, 59 

Myelin proteolipid protein - PLP 300 

Myelin-associated glycoprotein —- MAG, 59 

Myeloid/lymphoid leukemia, Drosophila 
trithorax homolog — MLL, 403 

Myeloperoxidase — MPO, 122, 239, 240 

Myogenic factor 3- MYODI, 28, 59, 251, 350 

Myogenic factor 5 - MYFS, 59, 251, 350 

Myogenic factor 6 - MYF6, 59, 251, 350 

Myogenin - MYOG, 59 

Myoglobin — MB, 160 

Myosin light chain, regulatory - MYLS, 228 

Myosin, a-, heavy chain - MYH6, 67, 404 

Myosin, B-, heavy chain - MYH7, 29, 67, 408 

NADH: ubiquinone oxidoreductase B13 
subunit - NDUFAS, 272 

NADP-dependent malate dehydrogenase — 
MEI, 390 

Nerve growth factor — NGFB, 60, 198 

Neural cell adhesion molecule - NCAM1, 59, 


112 

Neuroblastoma Ras proto-oncogene — NRAS, 
59, 62 

Neurofibromin — NF], 9, 29, 108, 110, 145, 247, 
269, 272, 274, 396 


Neurofilament protein, 68kDa — NEFL, 121 

Neuronal apoptosis inhibitory protein - NAIR 
346 

Neuroserpin — PI12, 192 

Neurotrophic tyrosine kinase receptor type 1 — 
NTRKI, 60, 198 

Neurotrophic tyrosine kinase receptor type 2 — 
NTRKz2, 198 

Neurotrophic tyrosine kinase receptor type 3 — 
NTRKS3, 144, 198 

Neurotrophin 3 —- NTF3, 198 

Neurotrophin 5 —- NTFS, 198 

Neutrophil cytosolic factor p47-phox — NCF1, 
277 

Nexin — PI7, 192 

Nitric oxide synthase, inducible - NOS2A, 246 

Non-specific cross-reacting antigen - NCA, 59 

Notch 1, homolog of Drosophila gene — 
NOTCH], 346 

Notch 4, homolog of Drosophila gene — 
NOTCH, 346, 348 

Nuclear factor I/C — NFIC, 113 

Nuclear factor I/X — NFIX, 113 

Nucleolin — NCL, 176 

Nucleophosmin — NPM1, 273 


Nucleotide excision repair protein RADS2 — 
RADSz2, 272 

Oculopharyngeal muscular dystrophy — 
PABP2, 362 

Olfactory receptor 10A1 - OR10A1, 143, 193 

Olfactory receptor 1Al - ORIAL], 143, 193 

Olfactory receptor 1D2 - ORID2, 143, 193 

Olfactory receptor 1D4 - ORID4, 143, 193 

Olfactory receptor 1D5 - ORIDS, 143, 193 

Olfactory receptor 1E1 - ORIEI, 143, 193 

Olfactory receptor 1E2 - ORIE2, 143, 193 

Olfactory receptor 1F1 - ORIFI, 143, 193 

Olfactory receptor 1G1 —- ORIGI, 143, 193 

Olfactory receptor 2C1 —- OR2C1, 174 

Olfactory receptor 2D2 - OR2D2, 143, 193 

Olfactory receptor 3A1 —- OR3A1, 143, 193 

Olfactory receptor 3A2 —- OR3A2, 143, 193 

Olfactory receptor 3A3 —- OR3A3, 143, 193 

Olfactory receptor 5D3 — OR5D3, 143, 193 

Olfactory receptor 5D4 - OR5D4, 143, 193 

Olfactory receptor 5F1 - ORSFI, 143, 193 

Olfactory receptor 6Al —- OR6A1, 143, 193 

Oligodendrocyte myelin glycoprotein - OMG, 
9, 109, 110 

Oncomodulin — OCM, 244 

Origin recognition complex, homolog of yeast 
gene — ORCIL, 145 

Ornithine aminotransferase — OAT, 343 

Ornithine aminotransferase-like 1 -OATLI, 78 

Ornithine carbamoyltransferase —- OTC, 15 

Osteocalcin — BGLAR 20 

Osteonectin — SPARC, 58, 230 

Otoconin 90 — OC90, 243 

Oxidized low density lipoprotein receptor 1 — 
OLRI, 354 

Oxytocin receptor - OXTR, 110 

P-glycoprotein 1 - PGY1, 358, 398, 408 

P-glycoprotein 3 — PGY3, 358, 408 

Paired basic amino acid cleaving enzyme 
(furin) - PACE, 197 

Paired basic amino acid cleaving enzyme 4 — 
PACE4, 197 

Paired box homeotic gene 1 — PAX], 144 

Paired box homeotic gene 2 — PAX2, 144 

Paired box homeotic gene 3 — PAX3, 144 

Paired box homeotic gene 4 — PAX4, 144 

Paired box homeotic gene 5 — PAXS, 144 

Paired box homeotic gene 6 — PAX6, 28, 144, 
253 

Paired box homeotic gene 7 — PAX7, 144 

Paired box homeotic gene 8 — PAX8, 144, 253 

Paired box homeotic gene 9 — PAX9, 144 

Pancreatic polypeptide/pancreatic icosapeptide 
—PPY, 25 

Parathyroid hormone — PTH, 59, 239, 241 

Parathyroid hormone-like hormone - PTHLH, 
59 

Parvalbumin — PVALB, 108 

Pepsinogen A3 — PGA3, 61, 120, 331, 404 

Pepsinogen A4 — PGA4, 61, 120, 331, 404 

Pepsinogen A5 — PGAS, 61, 120, 331, 404 

Peptidase B — PEPB, 60 

Peptidase C - PEPC, 60 
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Peptidase D — PEPD, 60 

Perlecan — HSPG2, 124 

Peroxisomal L-alanine: glyoxylate 
aminotransferase —-AGXT, 358 

Peroxisome proliferator-activated receptor a — 
PPARA, 189 

Peroxisome proliferator-activated receptor 5 — 
PPARD, 189 

Peroxisome proliferator-activated receptor y — 
PPARG, 21, 189 

Phenylalaninehydroxylase — PAH, 57, 60, 109, 
392, 393, 394 

Phosphatase and tensin homolog (mutated in 
multiple advanced cancers) — PTEN, 145 

Phosphatidylinositol glycan class F - PIGE 
273 

Phosphofructokinase, muscle — PFKM, 61, 
238 

Phosphofructokinase, polypeptide X - PFKX, 
61 

Phosphoglucomutase 3 - PGM3, 390 

Phosphogluconate dehydrogenase — PGD, 15 

Phosphoglycerate kinase 1 - PGK1, 406, 411, 
412 

Phosphoglycerate kinase 2 - PGK2, 411, 412 

Phospholipase AZ, group IB - PLA2G1B, 60 

Phospholipase AZ, group IIA — PLA2G2A, 60 

Phospholipase AZ, group IIC - PLAZG2C, 60 

Phospholipid transfer protein — PLTP 346 

Phosphomannomutase — PMM2, 277 

Phosphoribosylaminoimidazole carboxylase — 
PAICS, 233 

Phosphoribosylpyrophosphate 
amidotransferase — PPAT, 233 

Pigment epithelium-derived factor - PEDĘ 
191 

Pim-1 proto-oncogene — PIM], 348 

Placentally expressed gene — Plt, 244 

Plasminogen — PLG, 191, 197, 318, 357 

Plasminogen activator inhibitor 1 — PAIL, 114, 
236 

Plasminogen activator inhibitor 2 — PAI2, 192 

Plasminogen activator, tissue-type — PLAT, 126, 
191, 229 

Platelet-derived growth factor receptor-a — 
PDGFRA, 58 

Platelet-derived growth factor receptor-B — 
PDGFRB, 16, 58 

Pleiotrophin — PTN, 239, 243, 244 

Poly(A) binding protein - PABPLI, 176 

Polycystin - PKD1, 277, 350 

Polymeric immunoglobulin receptor - PIGR, 
60 

Post-meiotic segregation increased gene 1 
(homolog of yeast gene) - PMS1, 147 

Post-meiotic segregation increased gene 2 
(homolog of yeast gene) - PMS2, 9, 147 

Potassium voltage-gated channel, shaker- 
related subfamily, member 1 - KCNA1, 60 

Potassium voltage-gated channel, shaker- 
related subfamily, member 2 - KCNA2, 60 

Potassium voltage-gated channel, shaker- 
related subfamily, member 5 —- KCNAS, 60 


Potassium voltage-gated channel, shaker- 
related subfamily, member 7 - KCNA7, 60 

Potassium voltage-gated channel, Shaw-related 
subfamily, member 2 - KCNC2, 60 

Potassium voltage-gated channel, Shaw-related 
subfamily, member 3 - KCNC3, 60 

Potassium voltage-gated channel, Shaw-related 
subfamily, member 4 —- KCNC4, 60 

POU domain transcription factor 2F1 — 
POU2FI, 9, 61 

POU domain transcription factor 2F2 — 
POU2F2, 61, 254 

POU domain transcription factor 3F2 — 
POU3F2, 101 

Pre-B cell leukemia transcription factor 1 — 
PBX1, 57 

Pre-B cell leukemia transcription factor 2 — 
PBX2, 57, 346, 348 

Pre-B cell leukemia transcription factor 3 — 
PBX3, 57, 346 

Pregnancy zone protein - PZR 59 

Pregnancy-specific glycoprotein 1 — PSG1, 59, 
226, 247, 300, 347, 348 

Pregnancy-specific glycoprotein 11 — PSG11, 
59, 226, 247, 347, 348 

Pregnancy-specific glycoprotein 12 - PSG12, 
59, 226, 247, 347, 348 

Pregnancy-specific glycoprotein 13 —- PSG13, 
59, 226, 247, 347, 348 

Pregnancy-specific glycoprotein 2 — PSG2, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 3 — PSG3, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 4 — PSG4, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 5- PSGS, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 6 — PSG6, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 7 — PSG7, 59, 
226, 247, 347, 348 

Pregnancy-specific glycoprotein 8 — PSG8, 59, 
226, 247, 347, 348 

Progesterone receptor — PGR, 60, 189 

Prohibitin — PHB, 17 

Prolactin — PRL, 230 

Prolactin receptor —PRLR, 346 

Proliferating cell nuclear antigen — PCNA, 111 

Proline-rich protein, BstNI subfamily 1 — 
PRBI, 351, 354, 404, 408 

Proline-rich protein, BstNI subfamily 2 — 
PRB2, 351, 354, 404, 408 

Proline-rich protein, BstNI subfamily 3 — 
PRB3, 351, 354, 404, 408 

Proline-rich protein, BstNI subfamily 4 — 
PRB4, 351, 354, 404, 408 

Proline-rich protein, HaeIII subfamily 1 — 
PRH], 354, 357 

Proline-rich protein, HaeIII subfamily 2 — 
PRH2, 354 

Prolyl-4 hydroxylase B polypeptide — P4HB, 
399 


Properdin (B factor) - BF 124, 171 
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Proprotein convertase 1 — PCSK1, 197 

Proprotein convertase 2 — PCSK2, 197 

Proprotein convertase 4 — PCSK4, 197 

Proprotein convertase 5 — PCSKS5, 197 

Prosaposin — PSAP 109, 404 

Prostaglandin EP4 receptor - PTGER4, 268 

Protamine 1 —PRM1, 247, 300, 342, 348 

Protamine 2 — PRM2, 247, 300, 342, 348 

Proteasome subunit a2 —PSMA2, 348 

Proteasome subunit, beta type, 9 - LMP2, 
231 

Protein —PROS1, 191, 267 

Protein C— PROC, 126, 191, 427 

Protein C inhibitor — PCJ, 192 

Protein geranylgeranyltransferase type I, B — 
PGGTIB, 275 

Protein kinase, cAMP-dependent, catalytic 
alpha - PRKACA, 412 

Protein kinase, class I, « - PRKCA, 190 

Protein kinase, class I, B -PRKCB1, 190 

Protein kinase, class I, y- PRKCG, 190 

Protein kinase, class II, 6 -PRKCD, 190 

Protein kinase, class II, 8 - PRKCQ, 190 

Protein kinase, class III, e -PRKCE, 190 

Protein kinase, class IV, t— PRKC1, 190 

Protein kinase, class IV, € -PRKCZ, 190 

Protein serine/threonine kinase stk2 — STK2, 
342 

Protein tyrosine kinase, receptor type, f 
polypeptide -PTPRE 59 

Prothrombin — F2, 59, 88, 127, 191 

Prothymosin — PTMA, 271 

Protocadherin — PCDH7, 183 

Pulmonary surfactant protein A- SFTPA1, 
142, 348 

Pulmonary surfactant protein D —- SFTPD, 348, 
350 

Purine nucleoside phosphorylase — NP 110 

Purinoceptor 1 — P2RY1, 108, 348 

Purinoceptor 2 — P2RY2, 348 

Purinoceptor 4 — P2RY4, 348 

Purinoceptor 6 — P2RY6, 348 

Purinoceptor 7 — P2RY7, 348 

Pyruvate dehydrogenase a1 subunit - PDHAI1, 
66, 108, 411, 412 

Pyruvate dehydrogenase «2 subunit - PDHA2, 
411, 412 

Pyruvate kinase, liver/red blood cell - PKLR, 
60 

Pyruvate kinase, muscle — PKM2, 60 

Quiescin Q6 - OSCN6, 399 

Quinoid dihydropteridine reductase - QDPR, 
58 

Rab geranylgeranyl transferase œ subunit — 
RABGGTA, 10 

Raf proto-oncogene B1 homolog - BRAĘ 272 

Raf proto-oncogene homolog 1 - ARAFI, 78 

Ras proto-oncogene family member, Rab3A — 
RAB3A, 59 

Ras proto-oncogene family member, Rab3B — 
RAB3B, 59, 62 

Ras proto-oncogene family member, Rab4 — 
RAB4, 59, 62 


Ras proto-oncogene family member, RaplA — 
RAPIA, 59 

Ras proto-oncogene family member, Rap1B — 
RAPIB, 59, 62 

Ras proto-oncogene homolog — RRAS, 59 

Recombination protein RAD23A, homolog of 
yeast gene - RAD23A, 145 

Recombination protein RAD23B, homolog of 
yeast gene — RAD23B, 145 

Recombination protein, homolog of yeast 
Rad51 gene — RECA, 145, 148 

Recombination-activating protein 1 — RAGI, 


107, 406, 407 

Recombination-activating protein 1 — RAG2, 
107, 406, 407 

Red cone pigment — RCP 14, 20, 110, 318, 330, 
351, 401, 402, 408 

Regulator of mitotic spindle assembly 1 — 
RMSA1, 342 

Regulator of nonsense transcripts 1 — RENTI, 
115 


Rel proto-oncogene — REL, 144, 343 

Relaxin 1 —RLNI1, 186 

Relaxin 2— RLN2, 186 

Renin — REN, 61, 120 

Ret proto-oncogene — RET, 145 

Retinoblastoma — RBI, 29 

Retinoic acid receptor œ — RARA, 189, 190 

Retinoic acid receptor B - RARB, 189, 190 

Retinoic acid receptor y- RARG, 60, 189, 190 

Retinoid X receptor œ — RXRA, 57, 189, 346 

Retinoid X receptor B - RXRB, 57, 189, 346, 
348 

Retinoid X receptor y- RXRG, 57, 189 

Rhesus blood group, CcEe antigens - RHCE, 
141, 408 

Rhesus blood group, D antigen - RHD, 141, 
300, 331, 408 

Rhesus blood group-associated antigen — 
RHAG, 300 

Rho cross-reacting protein 27A, E. coli 
homolog — HES1, 147 

Rhodopsin — RHO, 316 

Rhombotin 1 — LMOI, 59 

Rhombotin-like 1 - LMO2, 59 

Rhombotin-like 2 - LMO3, 59 

Ribosomal protein L10 —- RPL10, 248 

Ribosomal protein L11 — RPL11, 248 

Ribosomal protein L13 —- RPL13, 248 

Ribosomal protein L15 —- RPL15, 248 

Ribosomal protein L17 — RPL17, 248 

Ribosomal protein L19- RPL19, 248 

Ribosomal protein L22 —- RPL22, 248 

Ribosomal protein L23A — RPL23A, 248, 271 

Ribosomal protein L24 — RPL24, 248 

Ribosomal protein L27 —- RPL27, 248 

Ribosomal protein L27A — RPL27A, 248 

Ribosomal protein L28 —- RPL28, 248 

Ribosomal protein L29 —- RPL29, 248 

Ribosomal protein L3 — RPL3, 248 

Ribosomal protein L30 — RPL30, 248 

Ribosomal protein L31 —- RPL31, 248 

Ribosomal protein L32 — RPL32, 248 
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Ribosomal protein L35A — RPL3SA, 248 
Ribosomal protein L36A — RPL36A, 248 
Ribosomal protein L37 — RPL37, 248 
Ribosomal protein L38 — RPL38, 248 
Ribosomal protein L4 — RPL4, 248 
Ribosomal protein L41 — RPL41, 248 
Ribosomal protein L5 — RPLS, 248 
Ribosomal protein L6 — RPL6, 248 
Ribosomal protein L7 — RPL7, 248 
Ribosomal protein L8 — RPL8, 248 
Ribosomal protein L9 — RPL9, 248 
Ribosomal protein $10 - RPS10, 248 
Ribosomal protein S11 - RPS11, 248 
Ribosomal protein $12 - RPS12, 248 
Ribosomal protein $13 - RPS13, 248 
Ribosomal protein $14 —-RPS14, 248 
Ribosomal protein SISA —RPSISA, 248 
Ribosomal protein $17 - RPS17, 248 
Ribosomal protein $18 - RPS18, 248 
Ribosomal protein S2 —- RPS2, 248 
Ribosomal protein $24 —- RPS24, 248 
Ribosomal protein $25 —- RPS25, 248 
Ribosomal protein $3 - RPS3, 248 
Ribosomal protein S3A — RPS3A, 248 
Ribosomal protein S4 —- RPS4X/Y, 248 
Ribosomal protein S4, X-linked — RPS4X, 78 
Ribosomal protein S4, Y-linked — RPS4Y, 78, 
79 
Ribosomal protein S5 —- RPS5, 248 
Ribosomal protein S6 - RPS6, 248 
Ribosomal protein S7 - RPS7, 248 
Ribosomal protein S8 —- RPS8, 248 
Ribosomal protein S9 —- RPS9, 248 
Ribosomal RNA cluster 1 — RNR1, 181, 360, 
410 
Ribosomal RNA cluster 2— RNR2, 181, 360, 
410 
Ribosomal RNA cluster 3 — RNR3, 181, 360, 
410 
Ribosomal RNA cluster 4-— RNR4, 181, 360, 
410 
Ribosomal RNA cluster 5- RNRS, 181, 360, 
410 
Ribosomal RNA, 5S — RN5S1, 182 
RNA, 7SK, nuclear — RN7SK, 238 
RNA, 7SL, cytoplasmic — RN7SL, 32, 340 
RNA-binding motif protein 1 - RBM1, 79, 80 
Ryanodine receptor 1 — RYRI, 59 
Ryanodine receptor 2 — RYR2, 59 
Secretory pathway protein, homolog of yeast 
gene — SECI3L1, 145 
Selected cDNA on X, mouse, homolog of — 
SMCX, 303 
Selected cDNA on Y, mouse, homolog of — 
SMCY, 79, 303 
Selectin, E- - SELE, 225 
Selectin, P-- SELP 229 
Semenogelin 1 - SEMGI, 110, 330, 354 
Semenogelin 2 - SEMG2, 110, 330, 354, 403 
Serine hydroxymethyltransferase, cytosolic — 
SHMT1, 273, 275 
Serotinin receptor 1F -HTRIE 108 
Serotonin receptor 1A — HTRIA, 58, 346 


Serotonin receptor 1D - HTRID, 108, 272 

Serpin, ovalbumin-like — PI10, 192 

Serpin, ovalbumin-like — PJ2, 192 

Serpin, ovalbumin-like — PJ6, 192 

Serpin, ovalbumin-like — PI9, 192 

Serum amyloid A — SAA, 267 

Serum amyloid Al - SAA1, 391 

Serum amyloid A2 - SAA2, 391 

Sex determining region Y — SRY, 78, 79, 80, 
107, 121, 223, 224, 300 

SHC transforming protein - SHC1, 272 

Short stature homeobox — SHOX, 81 

Sialyltransferase 4C — SIAT4C, 61 

Signal sequence receptor 5 — SSR4, 233 

SIR2 (Silent mating type information 
regulation 2), homolog of yeast gene — 
SIR2L, 147 

Slowpoke, homolog of Drosophila gene — SLO, 
144 

Small inducible cytokine A3 - SCYA3, 400 

Small inducible cytokine A3-like 1 — 
SCYA3L1, 400 

Small inducible cytokine A3-like 2 — 
SCYA3L2, 400 

Small inducible cytokine family A, member 18 
— SCYA18, 400 

Small nuclear ribonucleoprotein polypeptide N 
— SNRPN, 30, 61, 274 

Small nuclear ribonucleoprotein, 70KD 
polypeptide - SNRP70, 61, 176 

Small nuclear ribonucleoprotein, polypeptide A 
— SNRPA, 61 

Small nuclear ribonucleoprotein, polypeptide F 
—SNRPE, 61 

Small nucleolar mRNA 1 — RNEI, 9 

Small nucleolar mRNA 2 —- RNE2, 9 

Small proline-rich protein 1A -SPRRIA, 16, 
226, 355, 372, 373 

Small proline-rich protein 1B - SPRRIB, 16, 
226, 355, 372, 373 

Small proline-rich protein 2A - SPRR2A, 16, 
226, 227, 355, 356, 372, 373 

Small proline-rich protein 2C - SPRR2C, 226 

Small proline-rich protein 3 - SPRR3, 16, 226, 
227, 355, 372, 373 

Sodium channel 1, a-subunit — SCNIA, 347 

Sodium channel 10, a-subunit - SCNI0A, 347 

Sodium channel 2, a-subunit - SCN2A, 347 

Sodium channel 3, a-subunit —- SCN3A, 347 

Sodium channel 4, a-subunit — SCN4A, 23, 347 

Sodium channel 5, a-subunit — SCNSA, 23, 347 

Sodium channel 6, a-subunit — SCN6A, 347 

Sodium channel 7, o-subunit —- SCN7A, 347 

Sodium channel 8, a-subunit — SCNS8A, 23, 347 

Sodium channel 9, a-subunit — SCN9A, 347 

Solute carrier family 2, member 1 — SLC2A1, 
59 

Solute carrier family 2, member 3 — SLC2A3, 
59 

Solute carrier family 2, member 5 — SLC2A5, 
59 

Solute carrier family 6, member 10 —- SLC6A10, 
346, 395 
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Solute carrier family 6, member 4 —- SLC6A4, 
237 

Solute carrier family 6, member 8 — SLC6A8, 
346, 394, 395 

Solute carrier family 9, isoform A3 — SLC9A3, 
272 

Son of sevenless 1, homolog of Drosophila gene 
—SOS1, 144 

Son of sevenless 2, homolog of Drosophila gene 
— SOS2, 144 

Sonic hedgehog — SHH, 251 

Sorbitol dehydrogenase - SORD, 59 

Spectrin, a- — SPTA1, 356, 397 

Spectrin, B- — SPTB, 356 

Spinal muscular atrophy — SMA, 346 

Spinocerebellar ataxia type 1 (ataxin 1) - SCA1, 
362, 364 

Spinocerebellar ataxia type 2 (ataxin 2) - SCA2, 
362 

Spinocerebellar ataxia type 7 — SCA7, 362 

Squamous cell carcinoma antigen 1 - SCCA1, 
192 

Squamous cell carcinoma antigen 2 - SCCA2, 
192 

SR splicing factor SRp46 — Srp46, 412 

SRY-related HMG box 1 — SOX1, 250 

SRY-related HMG box 10 — SOX10, 250 

SRY-related HMG box 11 — SOX11, 250 

SRY-related HMG box 2 — SOX2, 250 

SRY-related HMG box 20 — SOX20, 250 

SRY-related HMG box 22 — SOX22, 250 

SRY-related HMG box 3 — SOX3, 79, 250 

SRY-related HMG box 4 — SOX4, 250 

SRY-related HMG box 5 — SOXS, 250 

SRY-related HMG box 9 — SOX9, 250 

Statherin — STATH, 58 

Steroid sulfatase — STS, 80, 390, 394 

Stromal cell-derived factor 1 - SDFI, 87 

Succinate dehydrogenase, iron-sulfur protein 
subunit — SDH1, 144 

Superoxide dismutase 1 - SOD1, 15 

Superoxide dismutase 2 - SOD2, 15 

Superoxide dismutase 3 - SOD3, 15 

Surfeit 1 - SURFI, 66 

Surfeit 2 - SURF2, 66, 234 

Surfeit 3 - SURF3, 66 

Surfeit 4- SURF4, 66, 234 

Surfeit 5 - SURFS, 66, 122 

Survival motor neuron protein — SMN, 346 

Synapsin I- SYNI, 78 

Synaptobrevin-like 1 -SYBLI, 81 

Synaptotagmin 1 —-SYTI, 319 

Synaptotagmin 2 —- SYT2, 319 

Synaptotagmin 3 —- SYT3, 319 

Synaptotagmin 4 —- SYT4, 319 

Synaptotagmin 5 —- SYTS, 319 

Syndecan 1 — SDC1, 56 

Syndecan 2 — SDC2, 56 

Syndecan 3 — SDC3, 56 

Syndecan 4 — SDC4, 56 

t complex responder — TCP10, 113 

T-cell antigen receptor, o-subunit — TCRA, 16, 
67, 110, 196, 335, 346 


T-cell antigen receptor, B-subunit —- TCRB, 5, 
16, 65, 67, 196, 331, 332, 333, 346, 391 

T-cell antigen receptor, 6-subunit - TCRD, 16, 
110, 196, 346, 410 

T-cell antigen receptor, ¢-subunit - TCRE, 16 

T-cell antigen receptor, y-subunit - TCRG, 16, 
196, 283, 331, 346 

TATA box-binding protein — TBP 348, 360 

Tenascin C —- HXB, 57 

Tenascin R — TNR, 57 

Tenascin XA — TNXA, 9, 57, 348 

Testis transcript Yl — TTY1, 79 

Testis transcript Y2 — TTY2, 79 

Testis-specific protein, Y-linked — TSPY, 79 

Thioredoxin — TXN, 399 

Thrombin receptor — F2R, 184 

Thrombomodulin — THBD, 8, 107 

Thrombospondin 1 — THBS1, 60, 347 

Thrombospondin 2 — THBS2, 347 

Thy-1 glycoprotein —- THYI, 60, 124 

Thymidine kinase 1 — TKZ, 15, 230 

Thymidine kinase 2 — TK2, 15 

Thymidylate synthase - TYMS, 110 

Thymosin B4 — TB4Y, 79 

Thyroglobulin — TG, 398 

Thyroid hormone receptor B — THRB, 189, 190 

Thyroid hormone receptor o (ear-7) - THRA, 
9, 189, 190 

Thyroid hormone receptor a-like (ear-1) — 
THRAL, 9, 189 

Thyroid peroxidase — TPO, 122 

Thyroid stimulating hormone beta - TSHB, 60 

Thyrotropin receptor — TSHR, 184 

Tissue inhibitor of metalloproteinase 1 — 
TIMPI, 66, 78, 110 

Titin — TTN, 9 

Topoisomerase I —- TOP1, 270 

Transaldolase — TALDOI, 34, 337 

Transcription factor 3 — TCF3, 59 

Transcription factor E2F1 — E2F1, 23 

Transcriptional elongation factor TFIIS — 
TCEA1, 273 

Transferrin — TĘ 16, 21, 352, 353 

Transferrin receptor — TFRC, 16 

Transforming growth factor B1 — TGFBI, 21, 
61, 347 

Transforming growth factor B2 — TGFB2, 61, 
347 

Transforming growth factor B3 — TGFB3, 347 

Transglutaminase 1 - TGM1, 10 

Transient receptor potential channel-related 
protein 1 — TRPCI, 144 

Transition protein 2 — TNP2, 342 

Transketolase-related gene — TKT, 130 

Translation elongation factor G - EEFIG, 148 

Translation elongation factor Tu - TUFM, 148, 
272 

Translation initiation factor 1A, Y-linked — 
EIFIAY, 79 

Transmembrane 7 superfamily, member 2 — 
TM7SF2, 399 

Transporter, ATP-binding cassette, 1 — TAP1, 
231 
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Transporter, ATP-binding cassette, 2 — TAP2, 5 

Transthyretin — TTR, 29, 142 

Trichohyalin - THH, 16 

Triose phosphate isomerase — TPI/, 15, 59, 118 

tRNA (alanine) - TRAN, 177 

tRNA (arginine 1) - TRR1, 177 

tRNA (arginine 3) — TRR3, 177 

tRNA (arginine 4) - TRR4, 177 

tRNA (asparagine) — TRN, 62, 177, 393 

tRNA (asparagine-like) - TRNL, 62, 393 

tRNA (glutamic acid) —- TRE, 62, 177 

tRNA (glutamic acid-like 1) - TRELI, 62 

tRNA (glutamine 1) - TRQI, 177 

tRNA (glycine) - TRGI, 177 

tRNA (leucine 1) - TRL, 177 

tRNA (leucine 2) - TRL2, 177 

tRNA (lysine) - TRK1, 177 

tRNA (methionine 1) - TRMII, 177 

tRNA (methionine 2) - TRMI2, 177 

tRNA (opal suppressor phosphoserine) - TRSPR 
178 

tRNA (proline 1) - TRP1, 177 

tRNA (proline 2) - TRP2, 177 

tRNA (proline 3) - TRP3, 177 

tRNA (threonine 1) — TRTI, 177 

tRNA (threonine 2) - TRT2, 177 

tRNA (valine 2) - TRV2, 60 

tRNA (valine 3) - TRV3, 60 

Tropomyosin, a- — TPM1, 112 

Tropomyosin, B- - TPM2, 321 

Troponin I, 1 — TNNII, 61, 114, 116, 143 

Troponin I, 2— TNNI2, 114, 143 

Troponin I, 3 —- TNNJ3, 114, 143 

Troponin T1, skeletal slow - TNNTI, 61 

Trypsin 1 - PRSS1, 191, 196, 197 

Trypsin 2 - PRSS2, 191, 196, 197 

Tryptophan hydroxylase — TPH, 57, 60, 392, 
393, 394 

Tubulin, B- - TUBB, 118, 245, 271 

Tubulin, œl- - TUBA], 118 

Tubulin, &2- —- TUBA2, 118 

Tumor necrosis factor — TNF, 140, 171, 236 

Tumor necrosis factor receptor 1 — TNFR1, 59 

Tumor necrosis factor receptor 2 — TNFR2, 59 

Tumor protein p53 — TP53, 29 

Tumor rejection antigen 1 —- TRA, 60 

Tumor supressor gene lethal 2 giant larvae, 
homolog of Drosophila gene - LLGLI1, 144 

Tyrosinase — TYR, 266 

Tyrosine hydroxylase — TH, 57, 60, 246, 392, 
393, 394 

Tyrosine kinase 2 - TYK2, 56, 60 

Tyrosine kinase STC — SRC, 56 

Tyrosine phosphatase PTP-BL related, Y-linked 
— PRY, 79 

U1 small nuclear RNA — RNUI, 62, 182 

U2 small nuclear RNA — RNU2, 182, 410 

U22 host gene — U22, 111 

Ubiquitin A52 —- UBAS2, 26, 179, 401, 410 

Ubiquitin A80 — UBA80, 179, 401 

Ubiquitin B — UBB, 26, 179, 410 

Ubiquitin C - UBC, 26, 179, 410 

Ubiquitin-activating enzyme — UBE1, 78, 79 


Ubiquitous TPR motif Y — UTY, 79 

UDP-galactose-4-epimerase — GALE, 15 

Upstream binding transcription factor - UBTE 
250 

Uracil-DNA glycosylase - UNG, 272 

Urate oxidase — UOX, 280 

Urokinase — PLAU, 127, 191 

Uteroglobin — UGB, 228 

Vasoactive intestinal polypeptide/PHM27 — 
VIB 25 

Villin 1 —- VILI, 109 

Villin 2 —- VIL2, 109 

Vimentin — VIM, 121 

Vitamin D receptor — VDR, 60, 189 

von Hippel-Lindau syndrome - VHL, 21 

von Willebrand factor - VWE 267, 277 

Wilms’ tumor protein — WT1, 30, 60, 113, 145, 
239, 240, 247 

Wingless 1, homolog of Drosophila gene — 
WNTI, 249 

X (inactive)-specific transcript — XIST, 9, 30, 83 

Xg blood group — XG, 78, 80, 266 

XK-related, Y-linked - XKRY, 79 

XRCCS5 DNA repair protein — XRCCS, 229 

Y box family member 1 - YBI, 147 

Yes proto-oncogene — YES1, 56 

Zinc finger, 11A — ZNFI11A, 193, 389 

Zinc finger, 11B — ZNF11B, 193, 390 

Zinc finger, 123 - ZNF123, 193 

Zinc finger, 125 — ZNF125, 193 

Zinc finger, 128 — ZNF128, 193 

Zinc finger, 129 — ZNF129, 193 

Zinc finger, 132 - ZNF132, 193 

Zinc finger, 134 —- ZNF134, 193 

Zinc finger, 135 - ZNF135, 193 

Zinc finger, 136 - ZNF136, 193 

Zinc finger, 137 - ZNF137, 193 

Zinc finger, 145 — ZNF145, 193 

Zinc finger, 146 — ZNF146, 193 

Zinc finger, 154 —- ZNF154, 193 

Zinc finger, 155 —- ZNF155, 193 

Zinc finger, 16 —- ZNF16, 193 

Zinc finger, 160 — ZNF160, 193 

Zinc finger, 165 — ZNF165, 193 

Zinc finger, 166 - ZNF166, 193 

Zinc finger, 167 — ZNF167, 193 

Zinc finger, 168 — ZNF168, 193 

Zinc finger, 173 - ZNF173, 193 

Zinc finger, 175 — ZNF175, 193 

Zinc finger, 204 — ZNF204, 193 

Zinc finger, 208 — ZNF208, 193, 396 

Zinc finger, 22 —- ZNF22, 193 

Zinc finger, 25 - ZNF25, 193, 389 

Zinc finger, 33A — ZNF33A, 193 

Zinc finger, 33B —- ZNF33B, 193, 390 

Zinc finger, 34 — ZNF34, 193 

Zinc finger, 35 —- ZNF35, 193 

Zinc finger, 37A — ZNF37A, 193, 389 

Zinc finger, 37B —- ZNF37B, 193, 390 

Zinc finger, 42 —- ZNF42, 193 

Zinc finger, 43 — ZNF43, 193 

Zinc finger, 45 —- ZNF45, 193 

Zinc finger, 52 — ZNFS52, 193 
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Zinc finger, 56 — ZNF56, 193 
Zinc finger, 58 — ZNF58, 193 
Zinc finger, 64 — ZNF64, 193 
Zinc finger, 66 — ZNF66, 193 
Zinc finger, 67 — ZNF67, 193 
Zinc finger, 69 — ZNF69, 193 
Zinc finger, 7 — ZNF7, 193 

Zinc finger, 70 — ZNF70, 193 


Zinc finger, 76 — ZNF76, 193 


Zinc finger, 80 — ZNF80, 239, 243, 358 


Zinc finger, 83 — ZNF83, 193 
Zinc finger, 85 — ZNF85, 193 
Zinc finger, 90 — ZNF90, 193 


Zinc finger, 91 — ZNF91, 141, 193, 251, 342 


Zinc finger, 92 - ZNF92, 193 
Zinc finger, 93 — ZNF93, 193 


Zinc finger, 71 — ZNF71, 193 
Zinc finger, 74 — ZNF74, 193 
Zinc finger, 75 — ZNF75, 272 


Zinc finger, X-linked — ZFX, 78, 84, 303, 412 
Zinc finger, Y-linked — ZFY, 79, 303, 412 


Gene symbols are as recommended by the Human Gene Nomenclature Committee — 
(http://www.gene.ucl.ac.uk/nomenclature/) at the time of going to press. Genes 
described by symbols in lower case letters have not yet been allocated a permanent 
symbol. 


Pseudogene symbols are not included. Symbols are sometimes abbreviated in the 
text when a cluster of genes is being described e.g. HOXA for the genes which 
comprise the homeobox A cluster. 


Index 


Accelerated evolution, see 
Evolution, accelerated 

Actins, 14, 15, 118, 140, 148, 
150-152, 223, 225 

Activated protein C resistance, 
88 

Albumin genes, 152, 348 

Allelic exclusion, 343 

Alternative splicing, see 


recruitment as promoter 
elements, 238-241, 341 

recruitment as silencer 
elements, 240 

regulatory elements within, 
32 

role in recombination of 
chromosomal segments, 
345, 404, 405 


Splicing, alternative 


Alu sequence(s) 


co-retrotranscription with 
other elements, 34 

consensus, 32 

copy number, 31 

evolution, 32, 340-341 

function, 32 

gene inactivation by 
insertion, 342 

human-specific, 32 

in untranslated regions of 
genes, 31, 342 

incorporation by intron 
sliding, 344 

insertion as cause of 
pathology, 342 

insertion hotspots, 341, 342 

internal regulatory elements, 
32 

intronic recombination, 
108 

location, chromosomal, 32, 
341 

location, intragenic, 31 

mediating gene conversion 
events, 277, 341, 411 

mediating gene deletions, 37, 
329, 341 

mediating gene duplications, 
164, 341 

mediating gene fusions, 165, 
341, 398 

mediating recombination 
events, 57, 108, 165, 404, 
405 

mediating transpositions, 
341, 394 

methylation, 32 

modulation of enhancer 
activity by, 240 

non-random insertion, 341, 
342 

numbers in great ape 
genomes, 341 

origin, 32, 340-341 

polymorphism, 342 

provision of transcriptional 
initiation sites by, 241 


splice-mediated insertion, 
342-343 
structure, 32 
sub-families, 32 
transcription, 32 
transpositionally competent, 
32 
within protein coding 
regions, 31, 342 
Amino acid substitution(s) 
chemical difference value, 
309 
effects on protein structure 
and function, 35, 36, 306 
factors determining clinical 
importance, 306-307 
factors determining 
evolutionary acceptability, 
307, 308-310 
rates, 298 
Amino acyl-tRNA synthetases, 
118, 119, 148, 178, 346, 
392-393, 398-399 
Ancient conserved regions, 144, 
145, 148 
Annexins, 114, 146, 225, 348, 
353 
Anticipation, 361 
Apolipoprotein(a), 236, 245, 
318, 357 
Apolipoprotein genes, 28, 110, 
116, 152-153, 224, 225, 
228-229, 236, 237, 238, 
300, 350 


Boundary elements, 22-23 
Branch point element, 24 


CCAAT box, 18, 111, 221, 228, 
230 
Cadherins, 183 
Caenorhabditis elegans 
(nematode), 247 
genes, 89 
genome sequencing, 89 
orthologues of human genes, 
144 
sequence database, 89 
Cambrian explosion, 149 


cAMP-responsive element, 223, 
225, 229, 230 
Centromere(s) 
centromere-specific protein 
CENP-B binding site, 4, 
338 
changes in location, 72 
inactivation, 73 
satellite DNA, see Satellite 
DNA, centromeric 
sequence exchange with 
telomeres, 396 
Centromere-specific protein 
CENP-B, 4 
evolution, 4 
potential to promote 
homologous 
recombination, 338 
transposase-encoded protein 
with cellular function, 
338 
Charcot-Marie-Tooth disease, 
345, 405 
Chi recombination element, 31, 
358, 411 
Chiasma(ta), 5 
frequency, 5 
Chimpanzee genome 
deletions/translocations, 140 
homology with human, 68, 
76, 140, 305 
Chromatin 
condensation, 3 
remodeling, 4, 20 
structure, 3—4, 27 
unfolding, 4 
Chromosome(s) 
arrangement on metaphase 
plate, 397 
bands, 5, 6, 7, 30, 32, 34 
evolution, see Chromosome 
evolution 
pairing, 4 
synteny, see Synteny 
number, primates, 73, 76 
structure, 3 see also 
Nucleosomal packing 
Chromosome evolution 
chromosome 21, 57, 74-75 
deletions, 72 
fissions, 72-73 
fusions, 72, 73 
in Old World primates, 
72-76 
in New World primates, 77 
insertions, 72 
interchromosomal 
differences in, 72 
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intrachromosomal 
duplications, 77 


inversions, 65, 66, 72, 73, 74, 


78, 80 


translocations, 72, 73-74, 76, 


77, 179, 193, 392-397 


Chromosome heteromorphism, 


76 
Chromosome maps 
comparative mapping in 
primates, 72-77 
Chromosome painting, 73 
comparative (ZOO-FISH), 
74-75 
Chymases, 423—424 
Cis-acting elements, see also 
Promoters, gene 
amplification, 229 
associated with bi- 


directional promoters, 232 


cAMP response element, 
223, 225, 229, 230 


CCAAT box, see CCAAT box 


CpG islands, 231, 233 
databases, see Promoters, 
sequence databases 
duplication, 238 
functional redundancy, 238 
importance of element 
spacing, 226 
initiator element, 18, 221, 
226, 227 
interactions with 
transcription factors, 253 
interferon-responsive 
element, 227 
negative regulatory, 19-20 
NFkB binding site, 225, 229 
octamer sequence element, 
226 
relative location, 
conservation of, 223-225 
retinoic acid response 
element, 238 
species-specific, 223 
TATA box, see TATA box 
Cloning, in silico, 89 
Codon usage, 
database, 304 
differences between 
chromosomal regions, 
304, 312 
differences between 
orthologous genes, 304 
differences between 
paralogous genes, 191 
DNA repair efficiency and, 
312 
evolution of, 304, 311-312 
inter-specific differences in, 
304, 311-312 
Coevolution 
antagonistic, 30 


between nuclear and 
mitochondrial genomes, 
82, 84 
cytokines/cytokine receptors, 
199 
fibroblast growth 
factors/receptors, 159, 199 
insulin/insulin-like growth 
factors/receptors, 198 
interleukins/interleukin 
receptors, 198 
keratins, types I and II, 169, 
198 
ligand-receptor pairs, 
198-199 
nuclear receptors/ligands, 
198 
owing to possession of 
shared biderectional 
promoters, 198 
rRNA subunits by 
compensatory slippage, 
363 
serine proteases/serine 
protease inhibitors, 191 
spectrins, 356 
subunits of heterodimeric 
proteins, 168, 198-199 
tRNAs/aminoacyl-tRNA 
synthetases, 178 
Cold shock domain proteins, 
197 
Collagen genes, 108, 109, 110, 
156-158, 231, 232 
Color blindness, 330 
Comparative mapping 
mammalian database, 63 
mouse-human, 65, 66 
primates, 72-77 
zebrafish-mammalian, 56-57 
Compensatory slippage, 363 
Complement genes, 109, 124, 
153-155, 331, 350, 351 
Concerted evolution, 150, 
355-356 
intergenic homogenization, 
355 
intragenic homogenization, 
355-356 
satellite DNA, see Satellite 
DNA 
Congenital malformations, 247 
Convergent evolution 
apolipoprotein(a), 357 
functional convergence, 197 
lysozyme, 301 
mechanistic convergence, 
197 
promoter elements, see 
Promoters, gene 
sequence convergence, 198 
structural convergence, 197 
Coordinate expression 


linked genes, 66, 168, 181, 
231, 247, 348 
unlinked genes, 247 
Copy DNA (cDNA), 8 
CpG dinucleotide, 26-29 
chromosomal differences in 
frequency, 313 
methylation-mediated 
deamination, 28, 29, 36, 
313, 314 
mutation hotspot in genes, 
28, 29, 36, 313 
mutation hotspot in 
pseudogenes, 28 
mutations in promoter 
elements, 226 
polymorphisms in, 12 
target for DNA methylation, 
26 
CpG island(s), 27-28 
association with 
bidirectional promoters, 
231 
chromosomal location, 6, 28 
erosion over evolutionary 
time, 28, 313 
genic location, 27-28 
role in modulating gene 
expression, 27-28 
CpG suppression, 28-29 
evolution, 28-29, 306 
Cryptic splice site utilization, 
see Splicing mutations, 
phenotypic consequences 
Crystallins, 16, 130, 155-156, 
185, 228, 231 
taxon-specific, 155, 231 
ubiquitous, 155 
Cystatins, 184 
Cytochrome P450 enzymes, 
183-184, 225, 350, 409, 
410 


Defensins, 129, 399-400 
Degeneracy of genetic code 
four-fold degenerate sites, 
300 
nondegenerate sites, 300 
relationship to nucleotide 
substitution rate, 298 
two-fold degenerate sites, 
300 
Deletions, gross 
exon, 353 
gene inactivating, 283, 
329-330 
hotspot, 338 
occurring during primate 
evolution, 283 
pathological, 37, 329 
polymorphism, 330-331 
recombination causing, 283, 
329 


Deletion hotspot consensus 
sequence, 331, 332 
Deletions, micro- 
combined with insertions, 
see Indels 
hotspots, 331, 332 
in-frame, 334 
inactivating genes during 
evolution, 13, 280, 281, 
332 
mediated by direct repeats, 
187, 331, 332-333 
mediated by inverted 
repeats, 187, 331, 333 
mediated by symmetric 
elements, 331 
pathological, 331-332 
relative frequency in 
evolution, 332, 334 
untranslated region 5’, 234 
within pseudogenes, 268 
Developmental control genes, 
247, 249 
functional redundancy of, 
252, 350 
Developmental switching 
evolution of, 152, 162, 
235 
gene silencing in, 235 
nucleotide changes required 
for, 162, 235 
Dictyostelium (slime mould), 
144, 146 
Disease susceptibility 
ABO blood groups and, 13, 
87 
HLA variation and, 86, 170, 
174 
infectious, 86, 87 
parasitic, 86, 87 
Divergent evolution, 150, 196 
DNA-DNA hybridization 
studies, 68 
DNA methylation, 26-30 
Amphioxus, 26 
coelenterate, 26 
echinoderm, 26 
evolution, 26, 28 
in sperm, 29 
insect, 26 
molluscan, 26 
pattern changes during 
development, 26-27 
pattern inheritance, 26-27 
pattern, inter-individual 
differences in, 27 
patterns, evolutionary 
stability, 29 
role in DNA 
replication/repair, 26 
role in gene regulation, 26 
role in imprinting, see 
Imprinting 


role in silencing expression 
of foreign DNA, 26 
role in X-inactivation, 313 
target specificity, 26 
vertebrate, 26-27, 28 
DNA polymerase genes, 148 
DNA polymerase arrest sites, 
331 
DNA polymorphism, see 
Polymorphism(s) 
DNA repair 
codon usage and, 312 
differences in efficiency 
between chromosomes, 67 
evolution, 149, 306-307 
gene density and, 6 
mutational bias and, 
306-307, 314 
reading frame sensitivity, 
314 
transcribed regions, 6, 306, 
314, 315 
DNA repair genes, 145, 147, 
149, 229 
DNA replication 
chromosomal bands, 
differences in timing 
between, 7 
genes and replication timing, 
7 
origins, see Origins of DNA 
replication 
proteins, 145, 147 
timing, 20, 31, 32 
units of, see Replicons 
DNA topoisomerase genes, 148 
DNAse I footprinting, 21, 
224-225 
Domain shuffling, 189 
Drosophila melanogaster (fruit 
fly), 247 
database, 143-144 
orthologues of human genes, 
143-144 
Duplications, gene 
accelerated evolution, 
subsequent to, see 
Evolution, accelerated 
causing genetic disease, 345 
change in expression 
consequent to, 61 
conservation of intron 
position subsequent to, 
346, 355 
creating multi-gene families, 
179, 346-347, 404 
during primate evolution, 
350 
functional redundancy 
consequent to, 251-252, 
350 
gene loss subsequent to, 150, 
162, 163, 166, 173, 180 
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inter-chromosomal 
translocation subsequent 
to, see Translocations 

inversions and, 231, 390, 391 

involvement of LINE 
elements in, 350 

leading to interspecific 
differences, 65 

leading to pseudogene 
formation, 61—63 

polymorphism, 351-352 

positive selection, see 
Selection, positive 

promoter divergence 
subsequent to, 222 

recombination causing, 193, 
404 

redundancy consequent to, 
57-58, 61-63, 180, 231 

resolution of function 
switching, 302 

tandem, 159, 160 

variable propensity to 
undergo, 347 


Duplications, genome, 55-63, 


142 

cytogenetic evidence for, 55 

evidence from comparative 
gene number, 56 

evidence from comparative 
nuclear DNA 
measurement, 55 

evidence from distribution 
of paralogous genes, 
55-56, 159, 160, 166 

evidence from paralogous 
chromosomal regions, 
55-61 

fate of vertebrate genes after, 
62-63, 199 

genetic redundancy 
consequent to, 57-58, 
61-63 

timing, 56 

yeast, 61-62 


Duplications, intra- 


chromosomal, 57 
genes involved, 57, 62, 160, 
346 
intracluster, 187, 194 
inverted, 186 
polymorphic, 352 
truncated gene copies, 
350-351 


Duplications, intragenic, 


154 

exonic, 156, 157, 318, 
353-355 

exonic, differences between 
orthologous genes, 193, 
354 

exonic, differences between 
paralogous genes, 354 
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intra-exonic, 152, 156, 175, 
355-356 
multi-exon, 352-353 
oligomer, 356 
polymorphisms, 357 
resulting in internal dyadic 
symmetry, 189, 191, 352 
successive, 352 
Duplicons, 329 
Dysmorphic syndromes, 247 


Ectopic transcripts, see mRNA 
transcription 
Editosome, 246 
Enhancer(s), 19, 20 
activity modulation by Alu 
sequence, 240 
association with matrix 
attachment regions, 7 
binding proteins, 20 
controlling sequences, see 
Boundary elements 
definition, 19 
downstream, 162 
function, 19 
inactivating mutations in, 
281 
intronic, 19 
locus control region, see 
Locus control region(s) 
sharing, 164, 166 
species-specific differences 
in intronic, 231 
Escherichia coli 
genes, 89 
genome sequencing, 89 
Estrogen response element 
(ERE), 229, 252, 
318-319 
Ets gene family, 144, 226-227, 
236 
Eukaryotes, evolution of, 89, 
107 
Evolution, accelerated 
post-duplicational, 58-59 
serine protease inhibitors, 
191, 192, 426 
serine proteases, 191, 426, 
432 
Evolution, concerted, see 
Concerted evolution 
Evolution, convergent, see 
Convergent evolution 
Evolution, directed, 372 
Evolution, discontinuous, 302 
Evolution, divergent, see 
Divergent evolution 
Evolution, horizontal, see 
Concerted evolution 
Evolution, morphological, see 
Morphological evolution 
Evolution of cellular processes, 
221 


Exon(s), 8 


amplification, 156, 158, 353 

average number, vertebrate 
gene, 116 

classificatory scheme, 116, 
117 

creation, 120, 121, 122, 123, 
353 

deletion, 121 

duplication, see 
Duplications, intragenic 

elongation, 164, 189 

fusion, 158 

large, 8, 175 

loss, 130 

maximum size, 116 

minimum size, 116 

non-symmetrical, 124 

number by gene, 116 

primordial, 129 

pseudoexons, see 
Pseudoexons 

relative frequency of 
different types, 116 

scrambling, see Exon 
scrambling 

shuffling, see Exon shuffling 

silencing, 130 

size, 116 

symmetrical, 124, 125 

truncation, 165 


Exon 


duplication/amplification, 
see Duplications, 
intragenic 


Exon scrambling, 403 
Exon shuffling, 122-130, 352 


evolutionary emergence, 123, 
130, 171 

human genes, 123, 125-128, 
424, 425 

in transcription factor genes, 
254 

intron-mediated 
recombination in, 123, 
406 

LINE element-mediated 
recombination, 123 

modularization, 124, 125 

protein domains, see Domain 
shuffling 

rules, 124 

use of symmetrical exons, 
124 


Exon skipping, see Splicing 


mutations, phenotypic 
consequences 


“Exon theory of genes”, 116, 


118, 119, 125 


Expressed sequence tags 


(ESTs), 9, 89 


Factor V Leiden, 88 


Fibroblast growth factor 
receptors, 16, 24, 158-159 
Fission, chromosomal, 72-73, 
74 
Fitness, 12, 297 
Fixation, 14, 297 
Fossil DNA, see Gene(s), fossil 
Fragile sites, 361, 362 
Fugu rubripes (pufferfish), 88 
alternative splicing, 112 
intron size, 108-109 
paucity of repetitive 
sequence, 108 
Function switching, 302, 349 
Fusion 
chromosomal, 77 
exonic, 158 
gene, see Gene fusion 
telomere, see Telomere, 
fusion 
transcriptional, see Fusion 
splicing 
X chromosome-autosome, 77 
Fusion splicing, 26, 243, 402 


GABA receptors, 159-160 
G-protein-coupled receptors, 
160, 184-185, 228 

G-protein genes, 160 
G:T mismatch repair, 313 
Gel retardation analysis, 21, 
236 
GenBank, 9 
Gene(s) 
3’ untranslated regions, see 
Untranslated region 3’ 
5’ untranslated regions, see 
Untranslated region 5’ 
clustering, 168, 247, 348 
definitions, 9-10, 178 
density, 6 
developmental control, 247, 
249 
distribution by 
chromosomal bands, 6, 
32 
expression, see Gene 
expression 
fossil, 423 
functional organization, 
14-17 
functional redundancy of, 
251-252, 350 
functions of human, 10 
fusion, see Gene fusion 
fusion splicing, see Fusion 
splicing 
highly conserved, 300 
housekeeping, 6, 79, 270 
human-chimpanzee 
homologies, 68 
human-Drosophila 
homologies, 143-144 


human-mouse homologies, 
67 

human-nematode 
homologies, 89, 144 

human-yeast homologies, 
145 

inactivation, see Gene 
inactivation 

in search of a function6, 351 

intronless, 8, 107—108, 187, 
189, 193, 412 

linear order and expression, 
16-17 

loss, 150, 162, 163, 166, 168, 
173, 180, 279-284 

nomenclature 

non-essential, 280 

number, 3, 7 

origins, see Genes, origins of 
human 

overlapping, 9-10 

primordial, 117, 119, 356 

polycistronic, 25-26 

polyprotein, 26 

pseudoautosomal, 80-81 

rapidly evolving, 78, 152, 
246, 298, 300 

retrotransposed, 178, 338, 
403, 411 

RNA-encoding, 9, 30 

structure, archetypal, 8 

tissue-specific, 6, 27 

truncated, 350-351 

within introns, 9, 110 


Gene amplification, 141, 174, 


196, 301 


Gene conversion 


as a cause of disease, 277 

definition, 276, 408 

frequency, 174 

haplotype diversity and, 13 

histone repeat 
homogenization by, 181 

in human gene evolution, 61, 
150, 162, 165, 169, 174, 
181, 182, 187, 335, 
408-411 

in triplet repeat expansion, 
365 

inter-chromosomal, 182, 277, 
397, 410-411 

inter-specific differences in, 
276 

internal repeat 
homogenization by, 179, 
355 

intra-chromosomal, 182, 277, 
408, 410 

intron sequence, 
homogenization by, 110 

length of sequence involved, 
277-278, 411 

mechanism of, 411 


minisatellite DNA, see 
Minisatellite DNA 

multigene family, 150 

of functional genes by 
pseudogenes, 277, 278 

of pseudogenes by functional 
genes, 279 

promoted by synteny, 66 

promoter regions, 222, 410 

promoting sequence 
diversification, 61, 411 

promoting sequence 
homogenization, 61, 161, 
173, 181, 182, 411 

pseudogene-mediated, 
277-278 

relative absence, 61 

resulting in gene 
inactivation, 277 

resulting in pseudogene 
reactivation, 279 

sequences potentially 
involved in promoting, 
411 

sequences potentially 
inhibiting, 411 


Gene creation, see 


Retrotransposition 


Gene duplication, see 


Duplications, gene 


Gene expression 


coordinate, see Coordinate 
expression 

developmental changes in, 
235-236 

differences between 
orthologous genes, 
228-231 

differences between 
paralogous genes, 173, 
225-228 

in embryonic development, 
16-17 

levels of control of, 10 

networks, 253 

nucleotide sequence changes 
responsible for differences 
in, 225-228, 229-230 

pathway, 11 

post-transcriptional 
regulation, 10 

promoter polymorphisms 
affecting, 236-237 

role of Alu repeat sequences, 
238-241, 341 

role of gene order in, 16-17 

tissue-specific, 10 


Gene families 


actins, 14, 15, 118, 140, 148, 
150-152, 223, 225 

albumin, 152, 348 

alcohol dehydrogenases, 16, 
17, 118, 141, 234 
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aminoacyl-tRNA 
synthetases, 177-178 see 
also Amino acyl-tRNA 
synthetases 

amylases, 14-15, 140, 242, 
330-331, 339, 390-391, 
404, 408 

apolipoprotein, 152-153 see 
also Apolipoprotein genes 

birth-and-death of, 150, 195, 
346, 396 

collagens, 108, 109, 110, 
156-158, 231, 232 

complement proteins, 109, 
124, 153-155, 331, 350, 
351 

concerted evolution, 150 

crystallins, 155-156 

defensins, 129, 399-400 

definition, 150 

divergent evolution, 150 

fibroblast growth factor 
receptors, 16, 24, 158-159 

fibroblast growth factors, 16, 
21, 158-159, 390 

G proteins, 160 

GABA receptors, 159-160 

globins, 15, 16, 20, 160-162, 
163, 223, 234, 235-236, 
247, 331, 408, 409 

glycophorins, 130, 164-165, 
166, 321, 335, 350, 404, 
405, 408 

growth hormone, 16, 20, 141, 
162, 164, 165, 251, 268, 
278, 302, 320-321, 404, 
410 

histones, 14, 150, 180-181, 
231 

homeobox, 14, 17, 56, 
165-168, 247, 249, 347 

integrins, 168-169 

keratins, 148, 168-170, 199 

lipases, 122, 144, 223, 236 

major histocompatibility 
complex, 13-14, 86, 
170-175 see also Major 
histocompatibility 
complex 

mechanisms of evolution, 
150 

mucins, 17, 116, 175, 176, 
357, 372 

ribosomal RNA, 9, 14, 150, 
181-182, 186, 360, 363, 
410, 423 

RNA-binding proteins, 
175-176, 177 

small nuclear RNA, 182-183 

sulfatases, 176 

tRNA, 177-178 

ubiquitins, 26, 144, 178-179 


Gene fusion 


causing inherited disease, 
397-398 

during evolution, 25-26, 162, 
165, 179, 189, 192, 230, 
398-401 

internal methionines as 
evidence for, 401—402 

polymorphism, see 
Polymorphism, fusion 
genes 

sequences mediating, 165, 
398 

somatic as cause of 
tumorigenesis, 397 


Gene inactivation 


a-1, 3-galactosyltransferase, 
280-281 

a-fertilin, 282-283 

ADP-ribosyltransferase 1, 
282 

by excessive production of 
aberrant transcripts, 115 

by gene conversion, see Gene 
conversion 

chorionic 
somatomammotropin 
pseudogene, 320 

CMP-N-acetylneuraminic 
acid hydroxylase, 283 

y-crystallin, 155 

during human evolution, 
282, 283 

during primate evolution, 
280, 282 

elastase I, 281 

flavin-containing mono- 
oxygenase 2, 283-284 

haptoglobin, 282 

in humans, 282, 283 

in mammals, 162 

in Old World monkeys, 162, 
280-281 

L-gulono-y-lactone oxidase, 
281-282 

T-cell receptor yV10, 283 

urate oxidase, 280 


Gene loss, 150, 162, 163, 166, 


168, 173, 180, 279-284 


Gene nurseries, 193, 351 
Gene order 


relationship to order of 
activation during 
ontogeny, 16-17, 161, 162 


Genes, origin of human 


eukaryote-specific, 145-146, 
160, 168, 178, 184, 190, 
191 

human-specific, 139-140, 
195 

mammalian-specific, 
141-142, 149, 195, 349, 
357 


480 HUMAN GENE EVOLUTION 


metazoan-specific, 143-144, 
158, 166, 189, 254 
primate-specific, 140-142, 
149, 164, 171, 187, 390 
universal, 146-149, 175, 178, 
180, 185 
vertebrate-specific, 143, 149, 
153, 173, 181, 184, 195, 
196 
Gene recruitment, 155 
Gene regulatory networks, 222, 
253 
Gene sharing, see Gene 
recruitment 
Gene superfamilies 
cadherins, 183 
cystatins, 184 
cytochrome P450 enzymes, 
183-184, 225, 350, 409, 
410 
definition, 183 
G-protein-coupled receptors, 
108, 160, 184-185 
heat shock proteins, 110, 
148, 155, 185 
immunoglobulins, 10, 14, 15, 
139, 150, 194-196, 226, 
346, 352 
insulin and insulin-like 
growth factors, 143, 186, 
198, 393 
interferons, 16, 108, 141, 
186-189, 408 
nuclear receptors, 189-190, 
252-253, 318-319 
olfactory receptors, 143, 174, 
193-194, 266, 319, 347, 351 
protein kinase C, 190 
serine proteases, 16, 
125-128, 190-191, 
424-436 
serpins (serine protease 
inhibitors), 191-192, 426 
T cell receptors, 10, 16, 196, 
331 
zinc finger proteins, 141, 
145, 192-193, 243, 251 
Genetic conflict hypotheses, see 
Imprinting 
Genetic code 
codon usage, see Codon 
usage 
degeneracy, see Degeneracy 
of genetic code 
evolution of, 304 
Genetic diversity 
chimpanzee, 76 
human, 85 
role of population size in 
determining, 85, 174 
role of population 
structure/dynamics in 
determining, 85, 174 


Genetic drift, 12, 297, 315, 330, 
334, 364 
Genome duplications, see 
Duplications, genome 
Genome mapping projects 
human, 89 
mammalian, 63, 65 
model organism, 88-89 
Genome size(s) 
Escherichia coli, 89 
nematode, 89 
primates, 76 
yeast, 89 
Glycophorins, 130, 164-165, 
166, 321, 335, 350, 404, 
405, 408 
Glycosylation sites 
differences between 
orthologous genes, 354 
Globins, see Gene families, 
globins 
Group selection, 317 
Growth hormone, see Gene 
families, growth hormone 
Growth regulation, 30, 68, 186 


Hemophilia, 391 
Haplotype(s), see 
Polymorphism(s) 
Heat shock protein genes, 108, 
110, 148, 155, 185, 249, 
346 
Helicase genes, 148 
Heterochromatin, 31, 32, 72, 
74, 75,77 
Heterozygote advantage, see 
Overdominant selection 
Hexokinases, 146, 147 
High mobility group (HMG) 
proteins, 146, 221, 250 
Histone genes, 14, 180-181, 231 
main-type, 180-181, 228 
replacement, 180-181 
Histone fold, 221 
Hitchhiker effect, see 
Polymorphism(s) 
Homeobox protein genes, 14, 
17, 56, 165-168, 247, 249, 
347 
Hominoid slowdown, 71-72, 
301 
Homology modelling, 424 
Horizontal evolution, see 
Concerted evolution 
Human-chimpanzee 
homologies, 68, 76, 140, 
305 
Human-rodent homologies, 67 
Human evolution 
cognitive ability, 88 
endurance, 88 
genetic diversity, 85 
hemostatic balance, 87—88 


hominoid slowdown, 71-72, 
301 
multiregional model, 84 
origins of modern humans, 
84-85 
out of Africa model, 84, 85 
pharmacogenetic variation, 
330, 335 
population bottlenecks, 84, 
85 
population growth, 85, 86 
population movements, 84, 
85 
role of natural selection, 
86-88 
skin pigmentation, 88 
Human Gene Mutation Database, 
35, 308 
Human Gene Nomenclature 
Committee, ix 
Human genome 
gene map, 89 
gene number, 3,7 
size, 3 
Human genome sequencing 
project, 89 
Human immunodeficiency 
virus (HIV) 
genetic variants that restrict 
infection, 86-87 


Immunoglobulin genes, see 
Gene super-families 
Imprinting, 29-30 
definition, 29 
DNA methylation, role of, 
26 
evolution, 30 
genes, imprinted, 
conservation of 
expression, 30 
genes, imprinted, function 
of, 30 
genes, imprinted, number of, 
30 
genes, imprinted, rate of 
evolution of, 30 
genetic conflict hypothesis, 
30 
polymorphism, 30 
Indels, 335-336 
Initiator element, 18, 221, 226, 
227 
Insertions 
differences between 
orthologous genes, 110 
gene inactivating, 37, 339 
introns as lightning 
conductors for, 111 
Insertion, splice-mediated, 
342-343 
Insertion, transposable 
elements, 336-344 


Alu sequences, 238-241, 
342 
endogenous retroviral 
sequences, 241-244 
LINES, 244-255, 339-340 
retroposons, 337 
SINES, 340-342 
transposons, 338—339 
Insertions, micro- 
coding region, 164 
combined with deletions, see 
Indels 
hotspot, 335 
in orthologous genes, 
334-335 
in paralogous genes, 334-335 
mediated by direct repeats, 
335 
pathological, 334 
polymorphisms, see 
Polymorphisms, micro- 
insertion 
promoter, 236, 335 
relative frequency, 334 
Insulators, see Boundary 
elements 
Insulin genes, 186, 198 
Insulin-like growth factor 
genes, 186, 198 
Integrin genes, 168-169 
Interferon-responsive element, 
227 
Interferon genes, 16, 108, 141, 
186-189, 408 
Intermediate filament proteins, 
144, 168 
Intracisternal A particle, 244 
Intron(s) 
alternatively spliced, see 
Splicing, alternative 
Alu sequence insertion into, 
108 
archaebacterial, 107 
AT-AC, non-canonical, 23 
classification, 8 
conservation of location in 
orthologous genes, 109, 
118, 355 
conservation of location in 
paralogous genes, 109, 
151, 156, 157, 164, 
183-184 
conservation of size between 
orthologous genes, 
108-109 
creation through retroviral 
LTR insertion, 121 
deletion, 118, 119, 120 
density and phylogenetic 
complexity, 107 
differences in location 
between paralogous genes, 
118 
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differences in number 
between orthologous 
genes, 120, 122 

encoding functional 
products, 111 

enhancers within, 19, 110 

excision mediated by V(D)J 
recombinase, 122 

gains and losses in 
paralogous genes, 151, 152 

genes within, 9, 110 

human genes lacking, see 
Gene(s), intronless 

insertion, 117, 118, 119, 120, 
121, 122, 152, 184 

insertion into coding 
sequence, 121 

location and association 
with protein domain 
distribution, 118, 119, 
120 

minimum size, 108 

negative regulatory elements 
in, 19, 110 

of imprinted genes, 30 

origin, 107 

phase, 120, 123-124 

phase compatibility, 
123-125 

promoter elements in, 110 

recombination between, 109, 
156, 399 

repressor elements in, 
110-111 

removal through 
retrotransposition, 108, 
121 

role as lightning conductors 
for retrotranspositional 
insertion, 111 

role in recombination 
suppression, 254, 406 

self splicing group II, 107 

sequence conservation, 
109-110 

sequence homogenization by 
gene conversion, see Gene 
conversion 

sequences affecting mRNA 
processing, 111 

sequences affecting 
nucleosome formation, 
111 

single base-pair 
substitutions in, 298, 299 

size, 8, 108 

size and phylogenetic 
complexity, 107 

size in Fugu rubripes, 
108-109 

size variation in human 
genes, 8 

sliding, see Intron sliding 
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species-specific differences 
in enhancers, see 
Enhancers 
Intron-mediated 
recombination, see Exon 
shuffling 
Introns early” theory, see Exon 
theory of genes 
Introns late” theory, 117, 118, 
119, 122 
Intron sliding, 118, 184, 344, 
345 
Inversions, chromosomal, 
389-392 
duplications and, 231 
hotspots in pathology and 
evolution, 72, 391, 392 
human-specific, 390, 392 
intragenic, 391 
mediated by LINE elements, 
390 
of entire gene sequences, 390 
paracentric, 65, 78, 389, 
390-391, 392 
pathological, 391, 392 
pericentric, 65, 72, 80, 152, 
389-390, 392 
physiological, 391 
polymorphic, 390 
somatic, 391 
Involucrin, 16, 367-373 
evolution in primates, 
368-373 
inter-specific differences, 
368-373 
polymorphism, 370 
protein, 367 
Iron response element, 21 
Isochores, 6 
evolutionary origin, 6 
families, 6 
Isozymes, evolution of, 349 


Keratin genes, 148, 168-170, 
199 
Kinetochore, 4 


Lactalbumin, a-, 142, 349 
Lactation, 142, 349 
Ligand promiscuity, 199 
LINE (long interspersed 
repeat) elements 
acquisition of promoters, 
244-245 
chromosomal location, 34 
co-retrotransposition with 
other elements, 34 
consensus, 33 
copy number, 33 
evolution, 33 
expression, 34 
gene inactivation by 
insertion, 339 


insertion within genes, 123 
involvement in 
translocations, 77 
location, chromosomal, 34, 
339 
mediated transduction of 
exons, 114, 123, 340 
mediated transduction of 
promoter elements, 
244-245 
mediating gene conversion, 
277 
mediated gene duplication, 
162, 350 
mediated inversion, 390 
non-random insertion, 339, 
340 
numbers in great ape 
genomes, 339 
open reading frames, 34 
origin, 33-34 
phylogenetic reconstruction, 
245 
polyadenylation site, 
provision of, 245 
promoter, internal, 34 
recruitment as promoter 
elements, 244-245, 340 
retrotransposition, 34 
reverse transcriptase activity, 
34 
role in promoting 
recombination, 57, 340 
structure, 33 
transcription, 34 
Lipases, 122, 144, 223, 236 
Locus control region(s), 10, 16, 
20, 161, 268 
associated boundary 
element, 22 
endogenous retroviral 
element within, 244 
evolutionary conservation, 
20, 224-225 
Lysozyme, 300-301, 349 


Major histocompatibility 
complex 
evolution, 170-175 
limited class II variability in 
marmoset, 175 
negative selection, 174 
overdominant selection, 14, 
86, 174 
polymorphism, 86, 170, 
174-175 
polymorphism, trans- 
species, 13-14 
positive selection, 174 
Mammalian evolution 
adaptive radiation, 63, 64 
conservation of synteny, 63 


molecular clocks, see 
Molecular clocks 
phylogenies, 63—64 
role of karyotypic change, 63 
Mammalian-wide interspersed 
repeat (MIR) elements, 
33, 245 
Matrix attachment region(s), 
6-7 
association with enhancer 
elements, 7 
association with origins of 
replication, 7 
association with 
topoisomerase cleavage 
sites, 7 
binding proteins, 7 
gene expression and, 6, 7 
role as boundary elements, 
22 
sequence characteristics, 7 
Matrix metalloproteinases, 146, 
348 
Maximum parsimony 
principle, 423, 424 
Megasatellite DNA, 31 
Meiotic drive, see Segregation 
distortion 
Metabolic pathways, evolution 
of, 15 
Methylation-mediated 
deamination, see CpG 
dinucleotide 
Myb gene family, 146 
Micro-deletion(s), see 
Deletions, micro- 
Microsatellite DNA, 31 
conservation of location at 
orthologous locations, 31, 
76, 246, 359 
diversity, 359 
evolutionary conservation, 
76, 245-246, 359 
insertion of, 108 
mutation rate, 359 
polymorphism, see 
Polymorphism, 
microsatellite 
recruitment as promoter 
elements, 245-246 
role in promoting gene 
conversion, 411 
utility in study of human 
populations, 85, 359 
Minisatellite DNA, 31 
binding proteins, 245 
expansion in progressive 
myoclonus epilepsy, 361, 
362 
gene conversion, 359 
inter-specific differences in 
repeat length, 5 
mutation rate, 358 


polymorphism, see 
Polymorphism, 
minisatellite 
recruitment as promoter 
elements, 245-246 
role in promoting 
recombinational 
instability, 6, 359 
telomeric, 4, 31 
Misalignment mutagenesis, 
314, 360 
Mismatch repair genes, 147 
Mitochondrial genome 
co-evolution with nuclear 
genome, 82, 84 
evolutionary origin, 82 
genes, 81 
mutation rate, 81-82 
Neanderthal, 84 
size, 81 
transfer of sequences to 
nuclear genome, 82 
variation, 84 
Mobility shift assays, see Gel 
retardation analysis 
Molecular clock(s) 
confounding effect of gene 
conversion, 164 
CpG deamination rate as, 
313 
hypothesis, 298 
in vertebrate evolution, 63 
Molecular reconstruction, see 
Phylogenetic 
reconstruction 
mRNA(s) 
aberrant transcripts, see 
mRNA transcription 
ectopic transcripts, see 
mRNA transcription 
effect of mutation on 
translation efficiency 35, 
234 
effect of mutation on 
stability, 35, 234 
illegitimate, see mRNA, 
ectopic transcripts 
intracellular localization, 22 
non-polyadenylated, 180 
nonsense, 115 
nucleocytoplasmic transfer, 
22 
semi-processed, 121 
size, 9 
surveillance, 115 
mRNA antisense transcripts, 
270 
mRNA editing 
enzyme, apobec-1, 246 
evolution, 246 
human genes exhibiting, 
246, 247 
species comparison, 246 


mRNA export, 22, 233 
mRNA splice sites, 23 
consensus sequences, 23 
evolution, 23, 24 
non-canonical, 23 
similarity to CAG repeats, 
363 
mRNA splicing, 8, 23 
alternative, see Splicing, 
alternative 
fusion splicing, see Fusion 
splicing 
mechanism, 23 
trans-splicing, see Trans- 
splicing 
mRNA stability, 22, 155, 233, 
234 
mRNA surveillance, 115 
mRNA transcription, 8 
aberrant, 115 
cotranscription, see Genes, 
polycistronic 
ectopic, 115-116, 270 
factors, see Transcription 
factors 


initiation, see Transcriptional 


initiation sites 
readthrough, 240, 402 
relationship with mutation, 
35, 234 
mRNA translation, 10, 11, 21, 
22, 233 
Morphological evolution, 
167-168, 247, 249 
Mucins, 17, 175, 176, 357, 372 
Muller’s ratchet, 79 
Multigene families, see Gene 
families 
Mutation 
hotspots, 28, 29, 36, 303 
recurrent, 14, 29 
somatic, 303 
transcription, see 
Transcription 
Mutation rates 
CpG deamination, 313 
differences between 
chromosomes, 67 
differences between 
orthologous genes, 426 
differences between 
transcribed and 
nontranscribed strands, 
314 
human deleterious, 304—305 
in pseudogenes, see 
Pseudogenes, processed 
inter-species comparisons, 
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nearest neighbor-dependent, 
303, 426 

relationship to duplex 
thermodynamic stability, 
315 

strand bias, 29, 314-315 


trade-off theory, 304 
X chromosome, 303, 304 
Y chromosome, 303 


Mutations, pathological 


classificatory system, 35 

deletions, gross, 35, 37 

deletions, micro-, 35, 37 

dominant negative, 35 

duplications, see 
Duplications, gene, 
causing genetic disease 

dynamic, 361 

effects on protein structure, 
36, 307-310 

gain of function, 35, 361 

gene conversion, see Gene 
conversion 

insertions, 35, 37, 342 

inversions, 35, 37, 391, 392 

loss of function, 35 

methylation-mediated 
deamination, see CpG 
dinucleotides 

missense, 35, 36 

neighboring nucleotide 
effect, 303, 314 

nonsense, 35, 36 

parallel with mutations 
occurring during 
evolution, 305-306 

promoter, 35, 37 

recent origin, 305 

single base-pair 
substitutions, coding 
region, 35, 36 

splicing, see Splicing 
mutations 

strand bias, 29, 314-315 

triplet repeat expansions, see 
Triplet repeat expansion, 
disorders of 


Myosins, 148 


Natural selection in human 


populations 
alcohol 
sensitivity/avoidance, 88 
endurance, 88 
hemostatic balance, 87—88 
pathogen-driven, 86-87, 
174 


301 
male-female ratio, 303 
natural selection and, 304, 
305, 307 


Negative regulatory elements, 
see Repressors, Silencers 

Neurotransmitters, 184 

Neutral theory of molecular 
evolution, 12, 297-298 
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Nonsense-mediated mRNA 
decay, see mRNA 
surveillance 

Nonsense mutation(s) 

gene inactivating in 
evolution, 280, 283, 315 

gene inactivating in 
pathology, see Mutations, 
pathological 

protein truncating in 
evolution, 164, 358 

tRNA suppressor, 178 

Nuclear receptors, 189-190, 
252-253 

Nucleosomal packing, 3 

association with 
transcriptional regulation, 
180, 221 

Nucleosomes, 3 

Nucleotide substitutions 

conservative, 299, 308 

in noncoding DNA, 305 

in pathology and evolution, 
305-307, 308-310 

nearest neighbor effects, 314 

neutral rate of, 299 

nonconservative, 299, 308 

nonsynonymous, 299, 300, 
301, 304, 305 

silent, 66, 306 

synonymous, 299, 300, 301, 
303, 304, 312 

thermodynamic stability 
and, 315 

transitions, 300 

transversions, 300 

Y chromosome, see Y 
chromosome 

Nucleotide substitution rates 

CpG deamination, 313 

coding regions, 298 

discontinuous, see Function 
switching 

hominoid slowdown, see 
Hominoid slowdown 

in primates, 71 

neutral rate, 299 

non-coding regions, 298 

non-synonymous, 299, 300, 
301, 304, 305 

pseudogenes, 276, 298, 304, 
306 

relative rate test, 301 

silent, since human-rodent 
divergence, 66 

synonymous, 299, 300, 301, 
303, 304, 312 

Y chromosome, see Y 
chromosome 


Odorant discrimination, 68, 
193-194 
Ohno’s law, 78 


Olfactory receptor genes, 143, 
174, 193-194, 266, 319, 
347, 351 
Opsins, see Visual pigment 
evolution 
Origins of DNA replication 
association with CpG 
islands, 7 
association with matrix 
attachment regions, 7 
consensus sequence, 7 
Orphan receptors, 189 
Orphon genes, 195 
Orthologous genes 
codon usage, differences 
between, 304 
conservation of intron 
location between, 109, 118 
conservation of intron size 
between, 108-109 
definition, 55 
differences in intron number 
between, 120 
differences in intron size 
between, 108 
differences in mutation rates 
between, 301, 426 
differences in promoter 
sequences, see Promoters, 
gene 
differences in promoter 
selection between, 
130-131 
differential usage of 
transcription initiation 
sites between, 235 
exon duplications, 
differences between, 193, 
354 
gene expression, differences 
between, 228-231 
homology exhibited by 
rodent and human genes, 
67 
length difference in 3’ 
untranslated regions, 234 
micro-deletions in, 333 
micro-insertions in, 334—335 
splicing differences between, 
113-114, 321, 403 
translational initiation 
codons, differences 
between, 358 
untranslated regions, length 
differences between, 234 
Overdominant selection 
ABO blood group, 14, 87 
cystic fibrosis, 86 
major histocompatibility 
complex, see Major 
histocompatibility 
complex 
polymorphisms, 14 


sickle cell disease, 86 

tri-allelic opsin genes in 
New World monkeys, 
317-318 


Paired box domain (PAX) 
proteins, 144, 249 
Paralogous genes 
codon usage, differences 
between, 191 
conservation of intron 
location between, 109, 151 
definition, 55, 346 
differences in alternative 
splicing between, see 
Splicing, alternative 
differences in intron location 
between, 118 
differential usage of 
transcription initiation 
sites between, 227 
distribution, 56-61 
evidence for common 
ancestry, 346 
evidence for genome 
duplication from, 55-56, 
57-61, 159, 160, 166 
exonic duplications, 
differences between, 354 
gene expression differences 
between, 173, 225-228 
intron gains and losses in 
151, 152 
micro-deletions in, 333 
micro-insertions in, 334-335 
origin by duplication, 
346 
origin by intra-chromosomal 
regional duplication, 57, 
346 
origin by tetraploidization, 
see Duplications, genome 
promoter sequence 
differences between, 162, 
180, 225-228 
splicing differences between, 
114, 320, 322 
untranslated regions, length 
differences in, 154, 234 
Paralogy, chromosomal 
regions 
evidence for genome 
duplication, see 
Duplications, genome 
Peptides, 184 
Pericentromeric region 
genes localized to, 394-396 
instability of, 390, 396 
transpositions associated 
with, 394-396 
Phylogenetic footprinting, 
223-225 
differential, 223-224, 236 


studies on human gene 
promoters, 223 
Phylogenetic reconstruction 
coding sequence, 423, 424 
LINE elements, 245 
promoter sequence, 224, 236 
Polyadenylation site(s), 
21-22 
alternative usage, 22, 114 
binding proteins, 22, 176 
contributed by LINE elements, 
245 
contributed by MIR elements, 
245 
contributed by retroviral 
elements, 244 
mutations, 234 
readthrough, 123 
substitution within, 35 
within coding sequence, 116 
Polymorphism(s) 
ABO blood group, 13 
active/inactive gene, 283, 
330-331 
age, 14 
Alu sequence 
presence/absence, 342 
balanced, 12, 14 
chromosomal, see 
Chromosomal 
heteromorphism 
coding sequence, 12, 13 
CpG dinucleotide, 12 
databases, 12 
definition, 11 
deletion, gross gene, 330-331 
deletion/insertion, 236 
deletional, 13 
duplicational, gene, 331, 
351-352 
duplicational, intragenic, 
357, 370-371 
Factor V Leiden, 88 
factors affecting allele 
frequency, 12, 14 
frequency in human gene 
regions, 13 
frequency in human genome, 
12, 299 
fusion genes, 401, 402 
gene copy number, 331, 
351-352, 404 
generated by gene 
conversion, 174 
haplotypes, 13, 175 


location, 12 

major histocompatibility 
complex, 13—14, 86, 170, 
174-175 

mechanisms of maintenance, 
12, 14 

megasatellite, 31 

micro-deletion, 13, 236 

micro-insertion, 13, 236, 335 

microsatellite, 246 

minisatellite, 358-359 

missense, 12, 87, 88 

mitochondrial, 84 

neutralist model, 12, 298 

nonsense, 12 

null variant, 12, 13, 408—409 

pathologically significant, 12 

pericentric inversion, 390 

population differences in 
allele frequency, 13 

private, 13 

promoter, 12, 236-237, 246 

protein, 85 

pseudogene 
presence/absence, 269, 274 

reactivated pseudogene, 
presence/absence, 278 

recurrent mutation, 14 

relationship to population 
size, 298 

restriction fragment length 
(RFLPs), 11 

retroviral LTR, 244 

rRNA gene copy number, 
182 

selectively advantageous, 
86-88 

single nucleotide (SNPs), 12, 
14 

telomere length, 5, 360 

telomeric repeat number, 5, 
360 

trans-species, 13—14, 87, 174, 
237, 246 

transient, 12 

triplet repeat, 360-363 

types, 12 

ubiquitin repeat copy 
number, 179 

variable gene number, 331, 
351-352, 404 

xenobiotic, 330, 335, 408, 
409 

Y chromosome, see Y 
chromosome 


hitchhiker effect, 12, 174 

HLA, 13, 170, 174-175 

imprinting, 30 

inter-chromosomal, 397 

inversion, 390 

length, 236-237 

level and population size, 
298 


Polyproteins, see Proteins 
Primate classification, 70 
Primate evolution 
adaptation, 67—68 
adaptive radiation, 68, 69 
chromosome number, 73, 76 
chromosomal, 72-77 
divergence times, 68—72 
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evolutionary longevity of 
different groups, 69 

fossil record, 68 

mutation rates, 71-72 

origin, 67, 68, 69 

phylogeny, 68-71 

relative genome sizes, 76 


Prolactin, 302 
Promoter(s), gene 


alternative, 17, 18, 230 
alternative use, differences 
between orthologous 

genes, 130-131 

Alu sequences, recruitment 
of, 238-241 

amplification of elements 
within, 229, 253 

bi-directional, paralogous 
genes, 231-233, 247, 391 

bidirectional, unrelated 
genes, 233 

cis-acting elements, 18, 19, 
see also Cis-acting 
elements 

constitutive elements, 18 

convergent evolution, 230 

deletions, 121, 249 

differences between 
orthologous genes, 223, 
224, 243, 244 

differences between 
paralogous genes, 162, 
180, 225-228 

duplication, 238 

endogenous retroviral 
elements, recruitment of, 
34, 121, 339, 399 

enhancers, see Enhancers 

functional redundancy, 238 

gene conversion, 222, 278, 
410 

heterologous fusion during 
evolution, 155, 230 

inactivation during 
evolution, 281 

intronic, 17—18, 231 

in vitro expression of site- 
directed mutants, 224 

LINE elements, recruitment 
of, by, 244-245, 340 

methylation-mediated 
deamination in, 226 

microsatellites, recruitment 
of as modulators of gene 
expression, 245-246 

minisatellites, recruitment of 
as modulators of gene 
expression, 245-246 

modularity, 222 

motif amplification, 229, 
253 

mutations, pathological, see 
Mutations, pathological 
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negative regulatory 
elements, see Repressors, 
Silencers 

phylogenetic footprinting, 
see Phylogenetic 
footprinting 

polymorphisms, see 
Polymorphisms, promoter 

recruitment, 8 

recruitment of pseudogene 
as, 242 

replacement, 32, 230, 244, 
399 

response elements, 18, 223, 
225, 227, 229, 230, 238, 
246, 318-319 

selection, inter-specific 
differences in, 234-235 

sequence databases, 222 

sharing, 156 

shuffling, 222, 253 

single base-pair 
substitutions in evolution, 
162, 180, 223, 224, 225, 
226-230 

TATA-less, 18, 227 


Protein(s) 


analysis of fossil, see Gene(s), 
fossil 

classification by invention 
period, 140 

databases, 129-130 

degradation, 178 

displaying domain 
duplication, 352 

domains, 124, 125, 126-128, 
129, 352, 424 

domains, fusion of, 398—401 

elongation, 335, 336, 358, 
364, 367 

folds, 128, 129 

function and evolutionary 
conservation, 298 

fusion, 402-403 

highly conserved, 300 

homology between 
orthologous human and 
rodent, 67 

isoforms, 14-15, 21, 149, 159 

isoforms, 
compartmentalized, 349 

isoforms, tissue-specific, 112, 
349 

isozymes, 349 

modular structure, 129 

module transfer, see Exon 
shuffling 

modules, 124, 129, 424 

molecular modelling, 424, 
428-436 

mosaic, 124, 130 

mutations, see Amino acid 
substitutions 


phylogenetic reconstruction, 
see Phylogenetic 
reconstruction 

polymorphisms, 85 

polyproteins, 26, 179 

rapidly evolving, 300, 367 

secondary structural 
elements, 128 

sequence conservation, 67, 
129 

sequence conserved regions, 
310 

structurally conserved 
regions, 310 

size, 9 

structural conservation, 129 

trifunctional, 25 

truncation, 115 


Protein C, evolution, 126, 127, 


431-436 


Protein kinase C superfamily, 


190 


Pseudoautosomal region(s) 


addition-attrition 
hypothesis, 80-81 

boundaries, 80 

boundary movement during 
evolution, 80 

evolution, 80 

genes, 80-81, 176, 394 

model for boundary 
formation, 80 

obligate recombination in, 5 

paralogous segment, 57 

size, 77 


Pseudoexons 


deletion within, 130 

human genes, 130-131 
mechanisms of creation, 130 
recombinational use, 130 


Pseudogenes 


as mediators of 
recombination, 265 
definition, 17 
mitochondrial genome 
derived, 82 
pericentromeric region 
directed, 269 
recruitment as promoter 
elements, 274 
sub-telomeric region 
directed, 351 
transcribed, 17, 161, 269, 
270 
Y chromosome, see Y 
chromosome 


Pseudogenes, duplicational, 17, 


265-269 
age, 275-276, 278 
duplicated processed 
pseudogenes, 269, 274 
gene conversion, see Gene 
conversion 


genes with high frequency 
of, 183, 193 
inactivating mutations, 17, 
265, 268, 269, 298 
mediating recombination 
events, see Recombination 
partial, see Relics, gene 
patterns of mutation, 276, 
298, 304, 306 
presence/absence 
polymorphism, 269 
primate-specific, 265 
reactivation of, 265, 278-279 
single copy, 17, 280 
transcribed, 17, 161, 266, 
267, 268, 269 
truncated, 171, 242 
Pseudogene(s), hybrid, 275 
Pseudogene(s), processed, 17, 
269-274 
antisense transcript derived, 
270 
characteristics, 269 
co-retrotransposition with 
other elements, 274 
copy numbers, 270 
gene conversion, see Gene 
conversion 
genes with high frequency, 
273 
housekeeping gene-derived, 
270 
insertion sites, 274 
intron-inserted, 274 
mutation rates, inter-specific 
comparisons, 276 
presence/absence 
polymorphism, 274 
primate-specific, 275-276 
promoter-inserted, 274 
semi-processed, 272, 275 
structure, 17 
suppressor element 
contributing, 274 
tissue-specific gene derived, 
270 
transcribed, 17, 270 
truncated, 269, 273 
with functional promoter 
elements, 270 
Pufferfish, see Fugu rubripes 
Pulmonary surfactant, 142 
Punctuated equilibrium, 113 


Rare allele advantage, see 
Selection, frequency- 
dependent 

Receptor promiscuity, 199 

Recombination, see also 
Unequal crossing over 

between pseudoautosomal 
region boundary 
sequences, 57, 80 


causing gene fusion, 398-401 
chiasmata, see Chiasma(ta) 
coldspots, 5, 405 
frequency by chromosomal 
band, 5 
gene inactivating, 283, 405 
haplotype generation, 13 
homologous unequal, 329, 
330, 335, 345, 350, 403-406 
hotspots, 5, 405 
in exon shuffling, see Exon 
shuffling 
intergenic, 355 
intragenic, 355 
intron-mediated, 109, 156, 
399 
involving pseudoexons, 130 
mediated by Alu sequences, 
see Alu sequences 
mediated by LINE elements, 
see LINE elements 
mediating gene deletion, see 
Deletions, gross 
mediating gene duplication, 
see Duplications, gene 
mediating gene inversion, see 
Inversions, chromosomal 
mediating promoter 
replacement, 404 
non-homologous, 345 
obligate in pseudoautosomal 
region, 5, 77 
proteins, 145, 148 
pseudogene-mediated, 265 
sequences inhibiting, 5, 405 
sequences promoting, 5, 
390, 394, 395, 404, see also 
Chi recombination 
element 
sex differences in, 5 
suppression, 79, 81 
transcription enhancement 
of, 5 
V(D)J, see V(D)J 
recombination 
Recombination-activating 
genes, 107, 339, 406-407 
Recombination signal 
sequences, see V(D)J 
recombination 
Recurrent mutation, see 
Mutation, recurrent 
Relics, gene (partial 
pseudogenes), 171, 195, 
196, 268-269, 350 
Remnants, see Relics, gene 
Repetitive DNA, 30-34 
Replication slippage, 253, 361, 
365 
Replicons, 7 
Repressors, 19-20, 21, 110-111, 
223, 230, 236, 240 see also 
Silencers 


Response elements, see 
Promoter(s), gene 
Retroelements, see Retroviral 
sequences, endogenous 
Retroposons, see Retroviral 
sequences, endogenous 
Retrotransposition 
generation of intronless 
genes by, 411 
generation of variation 
through imprecision, 337 
resulting in gene creation, 
178, 338, 403, 412 
Retroviral sequences, 
endogenous 
as promoters of gene 
conversion, 411 
as promoters of 
transcriptional fusion, 243 
evolution, 336-339 
families, 34, 336 
in gene promoter regions, 
241-244 
in gene coding regions, 34, 
337, 399 
origins, 337, 338 
recruitment as promoter 
elements, 34, 339, 399 
types, 34 
Y chromosome, 79 
Reverse transcriptase 
endogenous, 274 
LINE element-derived, 274 
Reverse transcription, 8, 274 
Ribosomal protein genes, 148, 
179, 247, 248, 273 
Ribosomal RNA genes, see 
Gene Families, Ribosomal 
RNA 
Ribotypes, 115 
RNA-binding proteins, 22, 
175-176, 177, 197 
RNA polymerases, 3, 19, 32, 34, 
181 
genes, 147, 221 
readthrough, 240, 402 
RNA 7SL, 32, 340 
RNA processing 
importance in evolution, 115 
RNases, evolution of, 423 
Rumination, evolution of, 423 


Saccharomyces cerevisiae (yeast) 
fate of duplicated genes in, 61 
genes, 89, 145 
genome database, 89 
genome duplication, 62 
genome sequencing, 89, 145 
orthologues in human 

genome, 145 

Satellite DNA 
alphoid, 4, 5, 31, 360, 396 
beta, 396 
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centromeric, 4, 31, 338, 360 
concerted evolution, 4 
evolution, 4, 338 
megasatellite, 31 
microsatellite, see 
Microsatellite DNA 
minisatellite, see 
Minisatellite DNA 
Scaffold attachment regions, see 
Matrix attachment 
regions 
Segregation distortion, 364 
Selection, frequency- 
dependent, 174 
Selection, negative 
genes evolving under, 12, 
174, 185, 297, 300, 316 
Selection, positive, 297, 298, 
300, 316 
amino acid substitutions 
resulting from, 316-319 
antigen recognition site, 
HLA, 174 
DNA-binding specificity of 
steroid receptors, 253, 
318-319 
genes evolving under, 300 
immunoglobulins, 195 
involucrin, 372 
lysozyme, 300-301 
olfactory receptor-ligand 
interactions, 319 
post-duplicational, 58, 61 
serpins, bait loop of, 192 
synaptotagmins, 319 
visual pigment genes, 
316-318 
Selection, purifying see 
Selection, negative 
Serine protease inhibitors 
(Serpins), 191-192, 426 
Serine proteases, 125-128, 
190-191, 197, 307-310, 
424-436 
Sex chromosome(s) 
divergence, 77-79 
evolution, 77-81, 82, 83 
homology between, 77 
Sex determining gene SRY, 78, 
79, 80, 107, 121, 300 
Short interspersed repeat 
(SINE) elements 
mariner elements, 5, 33, 108, 
405 
MIR elements, 33, 245 
Alu sequences, see Alu 
sequences 
Tigger elements, 33, 338 
Silencers, 19-20, 240 
Alu sequences as, 240 
Simple sequence repeats, 31 
Single base-pair substitution(s) 
in evolution 
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affecting mRNA stability, 
298 
coding region, 298, see also 
Selection, positive 
influencing polyadenylation 
site usage, 234 
promoter, see Promoter(s) 
gene 
splicing relevant, 164, 283, 
320-322 
Single base-pair substitution(s) 
in pathology, see 
Mutations, pathological 
Sister chromatid exchanges, 
182, 410 
Small nuclear RNA genes, 
182-183 
Small nuclear (sn) RNAs, 23, 
25, 273, 274 
Small ribonucleoprotein 
particles, 25 
Sodium channel genes, 23, 347 
Spectrins, 356 
Splice-mediated insertion, see 
Alu sequences 
Spliceosome, 25 
Splicing, fusion, see Fusion 
splicing 
Splicing, mRNA see mRNA 
splicing 
Splicing, alternative, 9, 23, 
111-115 
conservation in orthologous 
genes, 112-113 
database, 114 
developmental switch in, 
112 
differences between 
orthologous genes, 
112-114, 321, 403 
differences between 
paralogous genes, 114, 
320, 322 
evolutionary advantages, 
111-112 
involving Alu sequence 
cassettes 
potentiated by intragenic 
duplication, 355 
recent evolutionary origin, 
113 
redundancy after gene 
duplication, 114 
regulation of, 23 
resulting from 3 UTR SINE 
insertion, 113, 339 
resulting from alternative 
polyadenylation site 
usage, 114, 123 
resulting from alternative 
promoter usage, 114 
splicing enhancers, role of, 
24-25 


transcription factor genes, 
253 
types, 112 


Splicing enhancers, 24-25, 321 
Splicing mutations 


gene inactivating in 
evolution, 164, 283, 
320-322 

gene inactivating in 
pathology, 35, 36-37, 
319-320 

in orthologous genes, 321 

in paralogous genes, 321 

phenotypic consequences of, 
320 


STAT proteins, 144 
Steroid receptors 


DNA-binding specificity, 
226 


Sub-telomeric regions 


as nurseries” for genetic 
diversity, 193, 351 

inter-chromosomal 
exchange, 396-397 

spreading, 396 


Sulfatases, 176 
Susceptibility to infectious 


disease, 281, 282, 283, 284 


Synteny 


evolutionary conservation of, 
14, 56, 65, 66, 196, 348 

genes encoding different 
subunits of a heteromeric 
protein, 15-16 

genes encoding enzymes 
catalysing successive steps 
in metabolic pathway, 15 

genes encoding isozymes 
targeted to specific 
subcellular compartments, 
15 

genes encoding ligands and 
their receptors, 16 

genes encoding same 
product, 14 

genes encoding tissue- 
specific isoforms, 14-15 

genes of similar function and 
common evolutionary 
origin, 16 

human-mouse, 65 

utility in gene mapping 
studies/isolation, 63 

X chromosome, see X 
chromosome 


Tandem repeats, 31 
TATA box, 18, 221, 226, 227 
T cell receptor genes, 10, 16, 


196, 321, 346 


Telomerase, 5 
Telomere(s), 4 


chimpanzee, 5 


chromosomal bands, 6 
function, 4 
fusion, 72-73 
hexanucleotide repeats, 4, 
360 
length polymorphism, 5, 360 
length variation, 5 
proliferation of multigene 
families near, 351 
satellite DNA, 31 
sequence exchange between 
non-homologous 
chromosomes, 360 
sequence exchange with 
centromeres, 396 
sequence promoting meiotic 
recombination, 404 
sequence reorganization 
during primate evolution, 
360 
Termination codon 
introduction of novel, 164 
Tetraploidization, see 
Duplication(s), genome 
Topoisomerase cleavage sites 
association with deletion 
breakpoints, 331 
association with duplication 
breakpoints, 345 
association with matrix 
attachment regions, 7 
non-homologous 
recombination and, 6 
Trade-off theory, 304 
Trans-splicing, 403 
Transcription, see mRNA 
transcription 
Transcriptional activation 
evolutionary conservation, 
221 
Transcription-coupled repair, 3, 
306, 314, 315 
Transcription factor(s), 20-21, 
180, 221 
alternative splicing, 253 
binding specificities, 19 
CREB/ATF family, 230, 250 
CTF/NF-1, 250 
developmental regulatory, 
247, 249 
domains, 20, 21 
Ets family, 144, 226-227, 
236, 245, 250, 251 
evolutionary conservation, 
251 
families, 20 
functional conservation, 251 
functional redundancy, 
251-252, 254 
GATA-1, 21, 224, 229 
HMG family, 146, 221, 250 
interactions with cis 
regulatory motifs, 254 


homeobox, 20, 165-168 
Jun, 107, 144, 145 
MADS box, 147, 275 
mammalian-specific, 251 
MOR2, 193, 253 
MyoD family, 251 
NF!1 family, 224, 250 
NFkB, 225, 229, 245 
NF-Y, 146 
nuclear receptor family, 
189-190, 252-253 
orthologous, 253 
paralogous, 252-253 
phylogeny, 250 
Pit-1, 21, 227, 251, 254 
POU domain family, 107, 
250, 254 
primate-specific, 251 
relationship with histones, 
221 
Snail family, 249 
SP1, 18, 21, 224, 226, 
227-228, 238, 240, 245 
T-box family, 254 
TATA box-binding factor, 
TBP (TFIIB), 147, 251 
TCF/SOX family, 249, 250 
TFIIE, 147 
transcriptional fusion, 243 
Y-box, 147 
zinc finger proteins, 20, 
192-193, see also Zinc 
finger protein genes 
Transcriptional initiation 
site(s), 17, 18, 21 
differential usage between 
orthologous genes, 235, 
241 
differential usage between 
paralogous genes, 227 
evolutionary conservation, 
227 
in vitro usage, 234-235 
in vivo usage, 235 
provision by Alu sequence, 
241 
Transduction, see Long 
interspersed repeat 
(LINE) element(s) 
Translation, see mRNA 
translation 
Translational initiation codons, 
21 
alternative use, 21 
differences between 
orthologous genes, 358 
relative frequency of usage, 
21 
internal, as fossils of gene 
fusion, 402 
mutations, evolutionary, 
358 
mutations, pathological, 357 


Translation initiation factors, 
147 
Translational stop codons, 21 
relative frequency of usage, 
21 
protein elongation 
consequent to 
evolutionary loss, 358 
Translocation(s), 392-397 
autosome to X-Y, 394 
chromosomal, see 
Chromosomal evolution, 
translocations 
followed by duplication, 
347-348 
high frequency involvement 
of specific chromosomes, 
397 
hotspots in pathology and 
evolution, 397 
human-specific, 396 
inter-chromosomal, 160, 195, 
392-394, 395-396 
intra-chromosomal, 394 
pericentromeric-directed, 
269, 394-396 
post-duplicational, 347, 348 
reciprocal, 73-74 
repetitive sequences 
involved in mediating, 77, 
195, 394, 395, 396 
Transposable elements, 33, 34 
containing recombination- 
activating genes, 339 
mariner, 5, 33, 108, 405 
MIR, 33, 245 
reduction in gene conversion 
consequent to insertion 
of, 411 
retroelements, 34, 336-337 
role in recombination, 5 
THE] elements, 34, 108, 338 
transposons, 34, 336, 337, 
338 
Transposases 
centromeric protein CENP- 
B, 338 
pogo, 33, 338 
recruitment to cellular 
function, 34, 338 
Tigger, 33, 338 
Tramp, 338 
Transposition, see Translocation 
Transposons, see Transposable 
elements 
Triplet repeats 
conservation of orthologous 
sequences, 364-365 
evolution of repeat number, 
363-366 
frequency by type, 38, 363 
genes containing, 38, 360, 
363 
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length-dependent 
mutational bias, 361, 364 
origin of expanded, 363-364 
polymorphic, 38, 360-363 
replication slippage, 253, 
361, 365 
species-specific rates of 
expansion, 364-365 
stability, factors influencing, 
361 
Triplet repeat expansion, 38 
disorders of, 38, 360-364 
dynamic mutation, 38 
evolution, 363—364 
involucrin, see Involucrin 
involving promoter motifs, 
253 
role of DNA secondary 
structure in, 361 
sequence types, 360-363 
tRNA genes, 177-178, 304, 
393-394 
Tubulins, 118, 148 


Ubiquitins, 26, 144, 178-179, 
401, 410 
Unequal crossing over, 150, 
158, 162, 164, 165, 169, 
179, 181, 182, 187, 352, 
355, 365, 396, 399, 400, 
402 
Untranslated region 3’, 8, 
21-22, 116 
evolutionary conservation, 
233 
highly conserved regions, 
233 
incorporation into coding 
regions, 336 
insertion of Alu sequences 
into, see Alu sequences 
insertion of SINE element, 
339 
internal duplication, 355 
length difference between 
orthologous genes, 234 
length difference between 
paralogous genes, 154, 234 
single base-pair 
substitutions, 35, 298 
Untranslated region 5’, 8, 21, 
116 
evolutionary conservation, 
233 
highly conserved regions, 233 
insertion of Alu sequences, 
see Alu sequences 
length differences between 
paralogous genes, 154 
micro-deletion, 234 
single base-pair 
substitutions, 35, 298 
sizes, human genes, 116 
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V(D)J recombination, 406—407 
involvement in intron 
excision, 122 
recombination signal 
sequences, 122, 405, 406 
Vertebrate evolution 
increase in gene numbers, 
427 
molecular clocks, see 
Molecular clocks 
timescale, 63, 64 
Visual pigment evolution, 184, 
351, 401-402, 408 
Old World monkeys, 
316-317 
New World monkeys, 
317-318 
Vitamin K-dependent factors, 
see also Serine proteases 
evolution, 125—125, 142, 
424-436 
phylogenetic reconstruction, 
425-426 


X chromosome 
conserved region, 77, 78 
evolution, 77, 78 
genes with functional 
homologues on Y, 78 


inversions, 78 
mutation rates, see Mutation 
rates 
recently added region 
(XRA), 77, 78 
recombinational exchange 
with Y chromosome, 77 
regional duplications, 77 
synteny, conservation of, 78 
translocations, inter- 
chromosomal, 72, 77, 390 
X inactivation 
evolution, 81 
evolutionary spreading, 81 
genes escaping, 5, 78, 79, 81 
role of DNA methylation in, 
313 
XIST gene, role of in, 30, 83 
Xenobiotic metabolism, 
183-184, 330, 335, 398 


Y chromosome 

active genes, 78, 79 

gene acquisition by 
translocation, 77, 80 

gene amplification, 79 

gene loss from, 79 

genes with homologues on 
X, 78, 79, 176 


heteromorphism, 76 

inter-individual variation in 
DNA content, 76 

inversion, human-specific, 
72, 390 

mutation rate, 303 

non-recombining portion, 
78 

nucleotide substitution rate, 
79 

polymorphism, 79, 85 

pseudogenes, 78 

retroviral insertion, 79 

testis-expressed genes, 79 

ubiquitously expressed 
genes, 79 

Yeast, see Saccharomyces 

cerevisiae 


Z DNA, 5 
Zebrafish (Danio rerio), 247 
gene mapping, 56-57 

Zinc finger protein genes, 79, 
141, 145, 192-193, 243, 
251, 347, 358, 389-390, 

ZOO-FISH, see Chromosomal 
painting 


